hub / github.com/EvolvingLMMs-Lab/lmms-eval

github.com/EvolvingLMMs-Lab/lmms-eval @v0.7.2 sqlite

repository ↗ · DeepWiki ↗ · release v0.7.2 ↗

9,613 symbols 28,125 edges 906 files 3,331 documented · 35%

README

LMMs-Eval: Probing Intelligence in the Real World

PyPI - Downloads

We are building the unified evaluation toolkit for frontier models and probing the abilities in real world, shape what we build next.

🌐 Available in 17 languages

📚 Documentation | 📖 100+ Tasks | 🌟 30+ Models | ⚡ Quickstart

🏠 Homepage | 💬 Discord | 🤝 Contributing

Why `lmms-eval`?

Benchmarks decide what gets built next. A model team that trusts its eval numbers can focus on real improvements instead of chasing noise. But the multimodal evaluation ecosystem is fragmented - scattered datasets, inconsistent post-processing, and single-number accuracy scores that hide whether a gain is real or random. Two teams evaluating the same model on the same benchmark routinely report different results.

We believe better evals lead to better models. Good evaluation maps the border of what models can do and shapes what we build next.

We are building lmms-eval and focusing on three core principles:

Reproducible - One pipeline, deterministic results. Same model, same benchmark, same numbers, every time.
Efficient - Evaluation should not be the bottleneck, even at large scale. Async serving, adaptive batching, and video I/O optimizations keep your GPUs saturated end to end.
Trustworthy - Not just accuracy. Confidence intervals, clustered standard errors, paired comparisons, and ongoing research into evaluation methodology. Results you can trust enough to act on.

For how the pipeline works and the concrete mechanisms behind these principles, see How the Evaluation Pipeline Works and Why it's Efficient and Trustworthy.

What's New

v0.7 (Feb 2026) - Operational simplicity and pipeline maturity. 25+ new tasks across 8 domains, 2 new model backends, agentic task evaluation (generate_until_agentic), video I/O overhaul with TorchCodec (up to 3.58x faster), Lance-backed video distribution on Hugging Face, safety/red-teaming baselines, efficiency metrics (per-sample token counts, run-level throughput), and streamlined flattened JSONL log output for cleaner post-analysis. Release notes | Changelog.

v0.6 (Feb 2026) - Evaluation as a service. Standalone HTTP eval server, ~7.5x throughput over v0.5, statistically grounded results (CI, paired t-test), 50+ new tasks. Release notes | Changelog.

v0.5 (Oct 2025) - Audio expansion. Comprehensive audio evaluation, response caching, 50+ benchmark variants across audio, vision, and reasoning. Release notes.

Older updates

[2025-01] Video-MMMU - Knowledge acquisition from multi-discipline professional videos.
[2024-12] MME-Survey - Comprehensive survey on evaluation of multimodal LLMs.
[2024-11] v0.3 - Audio evaluation support (Qwen2-Audio, Gemini-Audio). Release notes.
[2024-06] v0.2 - Video evaluation (LLaVA-NeXT Video, Gemini 1.5 Pro, VideoMME, EgoSchema). Blog.
[2024-03] v0.1 - First release. Blog.

Quickstart

Install and run your first evaluation in under 5 minutes:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval && uv pip install -e ".[all]"

# Run a quick evaluation (Qwen2.5-VL on MME, 8 samples)
python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
  --tasks mme \
  --batch_size 1 \
  --limit 8

If it prints metrics, your environment is ready. For the full guide, see docs/getting-started/quickstart.md.

Installation

Using `uv` (Recommended for consistent environments)

We use uv for package management to ensure all developers use exactly the same package versions. First, install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

For development with consistent environment:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
# Recommend
uv pip install -e ".[all]"
# If you want to use uv sync
# uv sync  # This creates/updates your environment from uv.lock

To run commands:

uv run python -m lmms_eval --help  # Run any command with uv run

To add new dependencies:

uv add <package>  # Updates both pyproject.toml and uv.lock

Alternative Installation

For direct usage from Git:

uv venv eval
uv venv --python 3.12
source eval/bin/activate
# You might need to add and include your own task yaml if using this installation
uv pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

Reproduction of LLaVA-1.5's paper results

You can check the torch environment info and results check to reproduce LLaVA-1.5's paper results. We found torch/cuda versions difference would cause small variations in the results.

If you want to test on caption dataset such as coco, refcoco, and nocaps, you will need to have java==1.8.0 to let pycocoeval api to work. If you don't have it, you can install by using conda

conda install openjdk=8

you can then check your java version by java -version

Comprehensive Evaluation Results of LLaVA Family Models

As demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process).

We provide a Google Sheet for the detailed results of the LLaVA series models on different datasets. You can access the sheet here. It's a live sheet, and we are updating it with new results.

We also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data here.

If you want to test VILA, you should install the following dependencies:

pip install s2wrapper@git+https://github.com/bfshi/scaling_on_scales

Our Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.

Usage Examples

More examples can be found in examples/models

Evaluation with vLLM

Qwen2.5-VL:

bash examples/models/vllm_qwen2vl.sh

Qwen3-VL:

bash examples/models/vllm_qwen3vl.sh

Qwen3.5:

bash examples/models/vllm_qwen35.sh

Evaluation with SGLang

bash examples/models/sglang.sh

Qwen3.5:

bash examples/models/sglang_qwen35.sh

Evaluation of OpenAI-Compatible Model

bash examples/models/openai_compatible.sh

Evaluation of Qwen2.5-VL

bash examples/models/qwen25vl.sh

Evaluation of Qwen3-VL

bash examples/models/qwen3vl.sh

More Parameters

python3 -m lmms_eval --help

Environmental Variables

Before running experiments and evaluations, we recommend you to export following environment variables to your environment. Some are necessary for certain tasks to run.

export OPENAI_API_KEY="<YOUR_API_KEY>"
export HF_HOME="<Path to HF cache>"
export HF_TOKEN="<YOUR_API_KEY>"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export REKA_API_KEY="<YOUR_API_KEY>"
# Other possible environment variables include
# ANTHROPIC_API_KEY,DASHSCOPE_API_KEY etc.

Common Environment Issues

Sometimes you might encounter some common issues for example error related to httpx or protobuf. To solve these issues, you can first try

python3 -m pip install httpx==0.23.3;
python3 -m pip install protobuf==3.20;
# If you are using numpy==2.x, sometimes may causing errors
python3 -m pip install numpy==1.26;
# Someties sentencepiece are required for tokenizer to work
python3 -m pip install sentencepiece;

Custom Model Integration

lmms-eval supports two types of models: Chat (recommended) and Simple (legacy).

Chat Models (Recommended) 🌟

Location: lmms_eval/models/chat/
Use: doc_to_messages function from task
Input: Structured ChatMessages with roles (user, system, assistant) and content types (text, image, video, audio)
Supports: Interleaved multimodal content
Uses: Model's apply_chat_template() method
Reference: lmms_eval/models/chat/qwen2_5_vl.py or lmms_eval/models/chat/qwen3_vl.py

Example input format:

[
    {"role": "user", "content": [
        {"type": "image", "url": <image>},
        {"type": "text", "text": "What's in this image?"}
    ]}
]

Simple Models (Legacy)

Location: lmms_eval/models/simple/
Use: doc_to_visual + doc_to_text functions from task
Input: Plain text with <image> placeholders + separate visual list
Supports: Limited (mainly images)
Manual processing: No chat template support
Reference: lmms_eval/models/simple/instructblip.py

Example input format:

# Separate visual and text
doc_to_visual -> [PIL.Image]
doc_to_text -> "What's in this image?"

Key Differences

Aspect	Chat Models	Simple Models
File location	`models/chat/`	`models/simple/`
Input method	`doc_to_messages`	`doc_to_visual` + `doc_to_text`
Message format	Structured (roles + content types)	Plain text with placeholders
Interleaved support	✅ Yes	❌ Limited
Chat template	✅ Built-in	❌ Manual/None
Recommendation	Use this	Legacy only

Why Use Chat Models?

✅ Built-in chat template support
✅ Interleaved multimodal content
✅ Structured message protocol
✅ Better video/audio support
✅ Consistent with modern LLM APIs

Chat Model Implementation Example

from lmms_eval.api.registry import register_model
from lmms_eval.api.model import lmms
from lmms_eval.protocol import ChatMessages

@register_model("my_chat_model")
class MyChatModel(lmms):
    is_simple = False  # Use chat interface

    def generate_until(self, requests):
        for request in requests:
            # 5 elements for chat models
            doc_to_messages, gen_kwargs, doc_id, task, split = request.args

            # Get structured messages
            raw_messages = doc_to_messages(self.task_dict[task][split][doc_id])
            messages = ChatMessages(messages=raw_messages)

            # Extract media and apply chat template
            images, videos, audios = messages.extract_media()
            hf_messages = messages.to_hf_messages()
            text = self.processor.apply_chat_template(hf_messages)

            # Generate...

For more details, see the Model Guide.

Custom Dataset Integration

Task Configuration with `doc_to_messages`

Implement doc_to_messages to transform dataset documents into structured chat messages:

def my_doc_to_messages(doc, lmms_eval_specific_kwargs=None):
    # Extract visuals and text from doc
    visuals = my_doc_to_visual(doc)
    text = my_doc_to_text(doc, lmms_eval_specific_kwargs)

    # Build structured messages
    messages = [{"role": "user", "content": []}]

    # Add visuals first
    for visual in visuals:
        messages[0]["content"].append({"type": "image", "url": visual})

    # Add text
    messages[0]["content"].append({"type": "text", "text": text})

    return messages

YAM

Extension points exported contracts — how you extend this code

ShellEditorProps (Interface)

(no doc)

lmms_eval/tui/web/src/App.tsx

SelectProps (Interface)

(no doc)

lmms_eval/tui/web/src/App.tsx

ModelInfo (Interface)

(no doc)

lmms_eval/tui/web/src/App.tsx

TaskInfo (Interface)

(no doc)

lmms_eval/tui/web/src/App.tsx

YamlPreview (Interface)

(no doc)

lmms_eval/tui/web/src/App.tsx

Core symbols most depended-on inside this repo

group

called by 360

lmms_eval/api/group.py

update

called by 266

lmms_eval/models/model_utils/progress.py

called by 134

lmms_eval/models/model_utils/progress.py

generate

called by 96

lmms_eval/models/simple/srt_api.py

generate_submission_file

called by 80

lmms_eval/tasks/_task_utils/file_utils.py

add_partial

called by 72

lmms_eval/api/model.py

load

called by 72

lmms_eval/tasks/mmsearch/retrieve_content/tokenization/utils.py

get_batched

called by 70

lmms_eval/utils.py

Shape

Function 5,222

Method 3,686

Class 649

Route 42

Interface 14

Languages

Python99%

TypeScript1%

Modules by API surface

lmms_eval/tasks/voicebench/instruction_following_eval/instructions.py153 symbols

lmms_eval/tasks/ifeval/instructions.py153 symbols

lmms_eval/tasks/vbvr/vbvr_bench/evaluators/In_Domain_50_part5.py91 symbols

lmms_eval/tasks/vbvr/vbvr_bench/evaluators/Out_of_Domain_50_part1.py89 symbols

lmms_eval/tasks/vbvr/vbvr_bench/evaluators/Out_of_Domain_50_part4.py84 symbols

lmms_eval/tasks/vbvr/vbvr_bench/evaluators/Out_of_Domain_50_part3.py83 symbols

lmms_eval/api/task.py82 symbols

lmms_eval/tasks/vbvr/vbvr_bench/evaluators/In_Domain_50_part2.py79 symbols

lmms_eval/utils.py76 symbols

lmms_eval/api/metrics.py72 symbols

lmms_eval/tasks/vbvr/vbvr_bench/evaluators/Out_of_Domain_50_part5.py70 symbols

lmms_eval/tasks/vbvr/vbvr_bench/evaluators/Out_of_Domain_50_part2.py69 symbols

Dependencies from manifests, versioned

@types/react18.3.3 · 1×

@types/react-dom18.3.0 · 1×

@vitejs/plugin-react4.3.1 · 1×

autoprefixer10.4.19 · 1×

postcss8.4.38 · 1×

react18.3.1 · 1×

react-dom18.3.1 · 1×

tailwindcss3.4.4 · 1×

typescript5.5.2 · 1×

vite5.3.1 · 1×

Jinja21×

Requests2.32.3 · 1×

For agents

$ claude mcp add lmms-eval \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/EvolvingLMMs-Lab/lmms-eval @v0.7.2 sqlite

LMMs-Eval: Probing Intelligence in the Real World

Why lmms-eval?

What's New

Quickstart

Installation

Using uv (Recommended for consistent environments)

Alternative Installation

Usage Examples

Evaluation with vLLM

Evaluation with SGLang

Evaluation of OpenAI-Compatible Model

Evaluation of Qwen2.5-VL

Evaluation of Qwen3-VL

Custom Model Integration

Chat Models (Recommended) 🌟

Simple Models (Legacy)

Key Differences

Why Use Chat Models?

Chat Model Implementation Example

Custom Dataset Integration

Task Configuration with doc_to_messages

YAM

Extension points exported contracts — how you extend this code

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents

Why `lmms-eval`?

Using `uv` (Recommended for consistent environments)

Task Configuration with `doc_to_messages`