hub / github.com/microsoft/fara

github.com/microsoft/fara @main sqlite

repository ↗ · DeepWiki ↗

789 symbols 3,040 edges 87 files 305 documented · 39%

README

Fara-7B: An Efficient Agentic Model for Computer Use

Fara-7B Performance

Updates

2026-05-21 - Fara1.5 agent harness coming soon!
2026-05-12 — Refreshed WebTailBench (V2) tasks and rubrics. Many V1 tasks had calendar-bound dates that expired (Nov 2025); V2 rolls those forward and revises the precomputed rubrics for the full 609-task suite. Available now as the test_v2 split on microsoft/WebTailBench. A side-by-side V1↔V2 diff (task strings and rubric JSON) is hosted here.
2026-04-19 — Released CUAVerifierBench, a human-annotated benchmark for evaluating CUA verifiers (i.e. judges that score agent trajectories). Two splits — fara7b_om2w_browserbase (106 Fara-7B Online-Mind2Web/Browserbase trajectories, ~2 reviewers each) and internal (154 trajectories from a heldout aurora-v2 task suite) — with per-judge UV-blind / UV-informed labels, Universal Verifier outputs, and legacy verifier outputs side-by-side. The build script that produced the dataset lives alongside the data on HuggingFace.
2026-04-18 — Removed the autogen-core / autogen-ext dependency from webeval; chat completion clients are now self-contained under webeval/src/webeval/oai_clients/. No more autogen submodule install step; just pip install -e .[vllm] then cd webeval; pip install -e ..
2026-04-18 — Incorporated WebTailBench (initial / now-stale version) directly into the repo as a first-class benchmark. The loader auto-downloads WebTailBench-v1-rubrics.tsv from microsoft/WebTailBench and threads each task's published precomputed_rubric through to the verifier. Reproducibility CLI lives in webeval/scripts/webtailbench.py.
2026-04-18 — Released the Universal Verifier (MMRubricAgent) as the official verifier for WebTailBench. Multimodal, rubric-grounded, two-model ensemble (gpt-5.2 + o4-mini) with per-criterion scoring, outcome verification, and first-point-of-failure analysis. A stand-alone parallel runner is at webeval/scripts/verify_trajectories.py for re-scoring any directory of webeval trajectories without touching the solver.

Overview

Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA) that achieves state-of-the-art performance within its size class and is competitive with larger, more resource-intensive agentic systems.

Try Fara-7B locally as follows (see Installation for detailed instructions on Windows ) or via Magentic-UI:

# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara

# 2. Setup environment
python3 -m venv .venv 
source .venv/bin/activate
pip install -e .
playwright install

Then in one process, host the model:

vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto

Then you can iteratively query it with:

fara-cli --task "whats the weather in new york now"

To try Fara-7B inside Magentic-UI, please follow the instructions here Magentic-UI + Fara-7B. You will need to serve the model as before, but instead of fara-cli you can use Magentic-UI which has a nice UI (see video demos below).

Notes: - If you're using Windows, we highly recommend using WSL2 (Windows Subsystem for Linux). Please see the Windows instructions in the Installation section. - You might need to do --tensor-parallel-size 2 with vllm command if you run out of memory

**Shopping**

**GitHub Issues**

**Directions with Cheese**

What Makes Fara-7B Unique

Unlike traditional chat models that generate text-based responses, Fara-7B leverages computer interfaces—mouse and keyboard—to perform multi-step tasks on behalf of users. The model:

Operates visually by perceiving webpages and taking actions like scrolling, typing, and clicking on directly predicted coordinates without accessibility trees or separate parsing models
Enables on-device deployment due to its compact 7B parameter size, resulting in reduced latency and improved privacy as user data remains local
Completes tasks efficiently, averaging only ~16 steps per task compared to ~41 for comparable models

Fara-7B is trained using a novel synthetic data generation pipeline built on the Magentic-One multi-agent framework, with 145K trajectories covering diverse websites, task types, and difficulty levels. The model is based on Qwen2.5-VL-7B and trained with supervised fine-tuning.

Key Capabilities

Fara-7B can automate everyday web tasks including: - Searching for information and summarizing results - Filling out forms and managing accounts - Booking travel, movie tickets, and restaurant reservations - Shopping and comparing prices across retailers - Finding job postings and real estate listings

Performance Highlights

Fara-7B achieves state-of-the-art results across multiple web agent benchmarks, outperforming both comparable-sized models and larger systems:

Model	Params	WebVoyager	Online-M2W	DeepShop	WebTailBench
SoM Agents
SoM Agent (GPT-4o-0513)	-	90.6	57.7	49.1	60.4
SoM Agent (o3-mini)	-	79.3	55.4	49.7	52.7
SoM Agent (GPT-4o)	-	65.1	34.6	16.0	30.8
GLM-4.1V-9B-Thinking	9B	66.8	33.9	32.0	22.4
Computer Use Models
OpenAI computer-use-preview	-	70.9	42.9	24.7	25.7
UI-TARS-1.5-7B	7B	66.4	31.3	11.6	19.5
Fara-7B	7B	73.5	34.1	26.2	38.4

Table: Online agent evaluation results showing success rates (%) across four web benchmarks. Results are averaged over 3 runs.

WebTailBench: A New Benchmark for Real-World Web Tasks

We are releasing WebTailBench, a new evaluation benchmark focusing on 11 real-world task types that are underrepresented or missing in existing benchmarks. The benchmark includes 609 tasks across diverse categories, with the first 8 segments testing single skills or objectives (usually on a single website), and the remaining 3 evaluating more difficult multi-step or cross-site tasks.

WebTailBench Detailed Results

Task Segment	Tasks	SoM GPT-4o-0513	SoM o3-mini	SoM GPT-4o	GLM-4.1V-9B	OAI Comp-Use	UI-TARS-1.5	Fara-7B
Single-Site Tasks
Shopping	56	62.5	71.4	38.1	31.0	42.3	41.1	52.4
Flights	51	60.1	39.2	11.1	10.5	17.6	10.5	37.9
Hotels	52	68.6	56.4	31.4	19.9	26.9	35.3	53.8
Restaurants	52	67.9	59.6	47.4	32.1	35.9	22.4	47.4
Activities	80	70.4	62.9	41.7	26.3	30.4	9.6	36.3
Ticketing	57	58.5	56.7	37.4	35.7	49.7	30.4	38.6
Real Estate	48	34.0	17.4	20.1	16.0	9.0	9.7	23.6
Jobs/Careers	50	49.3	44.0	32.7	22.7	20.7	20.7	28.0
Multi-Step Tasks
Shopping List (2 items)	51	66.0	62.7	17.0	7.8	34.0	20.9	49.0
Comparison Shopping	57	67.3	59.1	27.5	22.8	1.2	8.8	32.7
Compositional Tasks	55	51.5	39.4	26.7	17.0	10.3	9.1	23.0
Overall
Macro Average	609	59.7	51.7	30.1	22.0	25.3	19.9	38.4
Micro Average	609	60.4	52.7	30.8	22.4	25.7	19.5	38.4

Table: Breakdown of WebTailBench results across all 11 segments. Success rates (%) are averaged over 3 independent runs. Fara-7B achieves the highest performance among computer-use models across all task categories.

Coming Soon: - Task Verification pipeline for LLM-as-a-judge evaluation - Official human annotations of WebTailBench (in partnership with BrowserBase)

CUAVerifierBench: Evaluating the Verifiers Themselves

While WebTailBench measures agents, CUAVerifierBench measures the judges that score those agents. Each row pairs a Fara-7B agent trajectory (instruction, screenshots, web_surfer log, final answer) with one human reviewer's verdict, plus the verdicts produced by the Universal Verifier (MMRubricAgent) and several legacy verifiers — so researchers can compute verifier–human agreement (Cohen's κ, accuracy, F1) on a fixed corpus and iterate on new judge prompts / architectures against a frozen ground-truth set.

The dataset is exposed as two HuggingFace configs joinable on task_id:

Config	Granularity	Contents
`trajectories`	one row per task	instruction, screenshots, web_surfer log, verifier outputs, task-level human aggregates
`annotations`	one row per (task, judge)	per-reviewer outcome / process labels and free-text justifications

Two splits ship today:

Split	Source	Trajectories	Annotation rows
`fara7b_om2w_browserbase`	Fara-7B trajectories on Online-Mind2Web tasks executed via Browserbase	106	215 (≈2 reviewers/task; UV-blind and UV-informed stages)
`internal`	Heldout aurora-v2 task suite scored with the same WebSurfer + verifier stack	154	154 (1 reviewer/task; UV-blind only)

Reviewer identities are anonymized as Judge1 … JudgeN using a single shared map across both splits. The build script that produced the dataset (with full schema + provenance) ships alongside the data on HuggingFace at microsoft/CUAVerifierBench; see the dataset README for the full column list.

from datasets import load_dataset

trajs = load_dataset("microsoft/CUAVerifierBench", "trajectories",
                     split="fara7b_om2w_browserbase")
anns  = load_dataset("microsoft/CUAVerifierBench", "annotations",
                     split="fara7b_om2w_browserbase")

Evaluation Infrastructure

Our evaluation setup leverages:

Playwright - A cross-browser automation framework that replicates browser environments
Abstract Web Agent Interface - Allows integration of any model from any source into the evaluation environment
Fara-Agent Class - Reference implementation for running the Fara model

Note: Fara-7B is an experimental release designed to invite hands-on exploration and feedback from the community. We recommend running it in a sandboxed environment, monitoring its execution, and avoiding sensitive data or high-risk domains.

Installation

Linux

The following instructions are for Linux systems, see the Windows section below for Windows instructions.

Install the package using pip and set up the environment with Playwright:

# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara

# 2. Setup environment
python3 -m venv .venv 
source .venv/bin/activate
pip install -e .[vllm]
playwright install

Note: If you plan on hosting with Azure Foundry only, you can skip the [vllm] and just do pip install -e .

Windows

For Windows, we highly recommend using WSL2 (Windows Subsystem for Linux) to provide a Linux-like environment. However, if you prefer to run natively on Windows, follow these steps:

# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara

# 2. Setup environment
python3 -m venv .venv
.venv\Scripts\activate
pip install -e .
python3 -m playwright install

Hosting the Model

Recommended: The easiest way to get started is using Azure Foundry hosting, which requires no GPU hardware or model downloads. Alternatively, you can self-host with vLLM if you have GPU resources available.

Azure Foundry Hosting (Recommended)

Deploy Fara-7B on Azure Foundry without

Core symbols most depended-on inside this repo

get

called by 466

src/fara/qwen_helpers/schema.py

join

called by 124

webeval/src/webeval/core.py

items

called by 47

webeval/src/webeval/rubric_agent/_cp_schema.py

sleep

called by 32

src/fara/browser/playwright_controller.py

call_llm

called by 18

webeval/src/webeval/rubric_agent/formatting.py

log_metric

called by 17

webeval/scripts/mlflow_rate_limiter.py

keys

called by 15

webeval/src/webeval/rubric_agent/_cp_schema.py

next_client

called by 14

webeval/src/webeval/oai_clients/graceful_client.py

Shape

Method 375

Function 275

Class 122

Route 17

Languages

Python97%

TypeScript3%

Modules by API surface

webeval/src/webeval/rubric_agent/data_point.py53 symbols

webeval/src/webeval/rubric_agent/mm_rubric_agent.py51 symbols

webeval/src/webeval/oai_clients/wrapper.py42 symbols

src/fara/browser/playwright_controller.py34 symbols

webeval/src/webeval/evaluators.py27 symbols

webeval/src/webeval/core.py21 symbols

src/fara/qwen_helpers/schema.py21 symbols

src/fara/fara_agent.py21 symbols

src/fara/browser/page_script.js21 symbols

webeval/src/webeval/rubric_agent/verifier_agent.py20 symbols

webeval/tests/test_oai_clients.py19 symbols

src/fara/browser/browser_bb.py19 symbols

Dependencies from manifests, versioned

Pillow11.1.0 · 1×

azure-identity1×

backoff2.2.1 · 1×

browserbase1×

docker1×

huggingface_hub1×

imagehash1×

jinja21×

joblib1×

jsonschema1×

mlflow1×

nest_asyncio1×

For agents

$ claude mcp add fara \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact