
test_v2 split on
microsoft/WebTailBench.
A side-by-side V1↔V2 diff (task strings and rubric JSON) is
hosted here.fara7b_om2w_browserbase (106
Fara-7B Online-Mind2Web/Browserbase trajectories, ~2 reviewers each) and
internal (154 trajectories from a heldout aurora-v2 task suite) —
with per-judge UV-blind / UV-informed labels, Universal Verifier
outputs, and legacy verifier outputs side-by-side. The build script
that produced the dataset lives alongside the data on HuggingFace.autogen-core / autogen-ext dependency
from webeval; chat completion clients are now self-contained under
webeval/src/webeval/oai_clients/. No more autogen submodule install
step; just pip install -e .[vllm] then cd webeval; pip install -e ..WebTailBench-v1-rubrics.tsv from
microsoft/WebTailBench
and threads each task's published precomputed_rubric through to
the verifier. Reproducibility CLI lives in webeval/scripts/webtailbench.py.MMRubricAgent)
as the official verifier for WebTailBench. Multimodal,
rubric-grounded, two-model ensemble (gpt-5.2 + o4-mini) with
per-criterion scoring, outcome verification, and first-point-of-failure
analysis. A stand-alone parallel runner is at
webeval/scripts/verify_trajectories.py for re-scoring any directory
of webeval trajectories without touching the solver.Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA) that achieves state-of-the-art performance within its size class and is competitive with larger, more resource-intensive agentic systems.
Try Fara-7B locally as follows (see Installation for detailed instructions on Windows ) or via Magentic-UI:
# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara
# 2. Setup environment
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
playwright install
Then in one process, host the model:
vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto
Then you can iteratively query it with:
fara-cli --task "whats the weather in new york now"
To try Fara-7B inside Magentic-UI, please follow the instructions here Magentic-UI + Fara-7B. You will need to serve the model as before, but instead of fara-cli you can use Magentic-UI which has a nice UI (see video demos below).
Notes:
- If you're using Windows, we highly recommend using WSL2 (Windows Subsystem for Linux). Please see the Windows instructions in the Installation section.
- You might need to do --tensor-parallel-size 2 with vllm command if you run out of memory
| **Shopping** | **GitHub Issues** | **Directions with Cheese** |
Unlike traditional chat models that generate text-based responses, Fara-7B leverages computer interfaces—mouse and keyboard—to perform multi-step tasks on behalf of users. The model:
Fara-7B is trained using a novel synthetic data generation pipeline built on the Magentic-One multi-agent framework, with 145K trajectories covering diverse websites, task types, and difficulty levels. The model is based on Qwen2.5-VL-7B and trained with supervised fine-tuning.
Fara-7B can automate everyday web tasks including: - Searching for information and summarizing results - Filling out forms and managing accounts - Booking travel, movie tickets, and restaurant reservations - Shopping and comparing prices across retailers - Finding job postings and real estate listings
Fara-7B achieves state-of-the-art results across multiple web agent benchmarks, outperforming both comparable-sized models and larger systems:
| Model | Params | WebVoyager | Online-M2W | DeepShop | WebTailBench |
|---|---|---|---|---|---|
| SoM Agents | |||||
| SoM Agent (GPT-4o-0513) | - | 90.6 | 57.7 | 49.1 | 60.4 |
| SoM Agent (o3-mini) | - | 79.3 | 55.4 | 49.7 | 52.7 |
| SoM Agent (GPT-4o) | - | 65.1 | 34.6 | 16.0 | 30.8 |
| GLM-4.1V-9B-Thinking | 9B | 66.8 | 33.9 | 32.0 | 22.4 |
| Computer Use Models | |||||
| OpenAI computer-use-preview | - | 70.9 | 42.9 | 24.7 | 25.7 |
| UI-TARS-1.5-7B | 7B | 66.4 | 31.3 | 11.6 | 19.5 |
| Fara-7B | 7B | 73.5 | 34.1 | 26.2 | 38.4 |
Table: Online agent evaluation results showing success rates (%) across four web benchmarks. Results are averaged over 3 runs.
We are releasing WebTailBench, a new evaluation benchmark focusing on 11 real-world task types that are underrepresented or missing in existing benchmarks. The benchmark includes 609 tasks across diverse categories, with the first 8 segments testing single skills or objectives (usually on a single website), and the remaining 3 evaluating more difficult multi-step or cross-site tasks.
| Task Segment | Tasks | SoM GPT-4o-0513 | SoM o3-mini | SoM GPT-4o | GLM-4.1V-9B | OAI Comp-Use | UI-TARS-1.5 | Fara-7B |
|---|---|---|---|---|---|---|---|---|
| Single-Site Tasks | ||||||||
| Shopping | 56 | 62.5 | 71.4 | 38.1 | 31.0 | 42.3 | 41.1 | 52.4 |
| Flights | 51 | 60.1 | 39.2 | 11.1 | 10.5 | 17.6 | 10.5 | 37.9 |
| Hotels | 52 | 68.6 | 56.4 | 31.4 | 19.9 | 26.9 | 35.3 | 53.8 |
| Restaurants | 52 | 67.9 | 59.6 | 47.4 | 32.1 | 35.9 | 22.4 | 47.4 |
| Activities | 80 | 70.4 | 62.9 | 41.7 | 26.3 | 30.4 | 9.6 | 36.3 |
| Ticketing | 57 | 58.5 | 56.7 | 37.4 | 35.7 | 49.7 | 30.4 | 38.6 |
| Real Estate | 48 | 34.0 | 17.4 | 20.1 | 16.0 | 9.0 | 9.7 | 23.6 |
| Jobs/Careers | 50 | 49.3 | 44.0 | 32.7 | 22.7 | 20.7 | 20.7 | 28.0 |
| Multi-Step Tasks | ||||||||
| Shopping List (2 items) | 51 | 66.0 | 62.7 | 17.0 | 7.8 | 34.0 | 20.9 | 49.0 |
| Comparison Shopping | 57 | 67.3 | 59.1 | 27.5 | 22.8 | 1.2 | 8.8 | 32.7 |
| Compositional Tasks | 55 | 51.5 | 39.4 | 26.7 | 17.0 | 10.3 | 9.1 | 23.0 |
| Overall | ||||||||
| Macro Average | 609 | 59.7 | 51.7 | 30.1 | 22.0 | 25.3 | 19.9 | 38.4 |
| Micro Average | 609 | 60.4 | 52.7 | 30.8 | 22.4 | 25.7 | 19.5 | 38.4 |
Table: Breakdown of WebTailBench results across all 11 segments. Success rates (%) are averaged over 3 independent runs. Fara-7B achieves the highest performance among computer-use models across all task categories.
Coming Soon: - Task Verification pipeline for LLM-as-a-judge evaluation - Official human annotations of WebTailBench (in partnership with BrowserBase)
While WebTailBench measures agents, CUAVerifierBench measures the judges that score those agents. Each row pairs a Fara-7B agent trajectory (instruction, screenshots, web_surfer log, final answer) with one human reviewer's verdict, plus the verdicts produced by the Universal Verifier (MMRubricAgent) and several legacy verifiers — so researchers can compute verifier–human agreement (Cohen's κ, accuracy, F1) on a fixed corpus and iterate on new judge prompts / architectures against a frozen ground-truth set.
The dataset is exposed as two HuggingFace configs joinable on task_id:
| Config | Granularity | Contents |
|---|---|---|
trajectories |
one row per task | instruction, screenshots, web_surfer log, verifier outputs, task-level human aggregates |
annotations |
one row per (task, judge) | per-reviewer outcome / process labels and free-text justifications |
Two splits ship today:
| Split | Source | Trajectories | Annotation rows |
|---|---|---|---|
fara7b_om2w_browserbase |
Fara-7B trajectories on Online-Mind2Web tasks executed via Browserbase | 106 | 215 (≈2 reviewers/task; UV-blind and UV-informed stages) |
internal |
Heldout aurora-v2 task suite scored with the same WebSurfer + verifier stack | 154 | 154 (1 reviewer/task; UV-blind only) |
Reviewer identities are anonymized as Judge1 … JudgeN using a single shared map across both splits. The build script that produced the dataset (with full schema + provenance) ships alongside the data on HuggingFace at microsoft/CUAVerifierBench; see the dataset README for the full column list.
from datasets import load_dataset
trajs = load_dataset("microsoft/CUAVerifierBench", "trajectories",
split="fara7b_om2w_browserbase")
anns = load_dataset("microsoft/CUAVerifierBench", "annotations",
split="fara7b_om2w_browserbase")
Our evaluation setup leverages:
Note: Fara-7B is an experimental release designed to invite hands-on exploration and feedback from the community. We recommend running it in a sandboxed environment, monitoring its execution, and avoiding sensitive data or high-risk domains.
The following instructions are for Linux systems, see the Windows section below for Windows instructions.
Install the package using pip and set up the environment with Playwright:
# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara
# 2. Setup environment
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[vllm]
playwright install
Note: If you plan on hosting with Azure Foundry only, you can skip the [vllm] and just do pip install -e .
For Windows, we highly recommend using WSL2 (Windows Subsystem for Linux) to provide a Linux-like environment. However, if you prefer to run natively on Windows, follow these steps:
# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara
# 2. Setup environment
python3 -m venv .venv
.venv\Scripts\activate
pip install -e .
python3 -m playwright install
Recommended: The easiest way to get started is using Azure Foundry hosting, which requires no GPU hardware or model downloads. Alternatively, you can self-host with vLLM if you have GPU resources available.
Deploy Fara-7B on Azure Foundry without