MCPcopy Index your code
hub / github.com/allenai/olmocr

github.com/allenai/olmocr @v0.4.27 sqlite

repository ↗ · DeepWiki ↗ · release v0.4.27 ↗
1,717 symbols 6,915 edges 149 files 853 documented · 50% 7 cross-repo links
README

olmocr-2-full@2x


GitHub License GitHub release Tech Report v1 Tech Report v2 Demo Discord

A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.

Try the online demo: https://olmocr.allenai.org/

Features: - Convert PDF, PNG, and JPEG based documents into clean Markdown - Support for equations, tables, handwriting, and complex formatting - Automatically removes headers and footers - Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets - Efficient, less than $200 USD per million pages converted - (Based on a 7B parameter VLM, so it requires a GPU)

News

  • October 21, 2025 - v0.4.0 - New model release, boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training.
  • August 13, 2025 - v0.3.0 - New model release, fixes auto-rotation detection, and hallucinations on blank documents.
  • July 24, 2025 - v0.2.1 - New model release, scores 3 points higher on olmOCR-Bench, also runs significantly faster because it's default FP8, and needs much fewer retries per document.
  • July 23, 2025 - v0.2.0 - New cleaned up trainer code, makes it much simpler to train olmOCR models yourself.
  • June 17, 2025 - v0.1.75 - Switch from sglang to vllm based inference pipeline, updated docker image to CUDA 12.8.
  • May 23, 2025 - v0.1.70 - Official docker support and images are now available! See Docker usage
  • May 19, 2025 - v0.1.68 - olmOCR-Bench launch, scoring 77.4. Launch includes 2 point performance boost in olmOCR pipeline due to bug fixes with prompts.
  • Mar 17, 2025 - v0.1.60 - Performance improvements due to better temperature selection in sampling.
  • Feb 25, 2025 - v0.1.58 - Initial public launch and demo.

Benchmark

olmOCR-Bench: We also ship a comprehensive benchmark suite covering over 7,000 test cases across 1,400 documents to help measure performance of OCR systems.

ArXiv Old scans math Tables Old scans Headers & footers Multi column Long tiny text Base Overall
Mistral OCR API 77.2 67.5 60.6 29.3 93.6 71.3 77.1 99.4 72.0±1.1
Marker 1.10.1 83.8 66.8 72.9 33.5 86.6 80.0 85.7 99.3 76.1±1.1
MinerU 2.5.4* 76.6 54.6 84.9 33.7 96.6 78.2 83.5 93.7 75.2±1.1
DeepSeek-OCR 77.2 73.6 80.2 33.3 96.1 66.4 79.4 99.8 75.7±1.0
Nanonets-OCR2-3B 75.4 46.1 86.8 40.9 32.1 81.9 93.0 99.6 69.5±1.1
PaddleOCR-VL* 85.7 71.0 84.1 37.8 97.0 79.9 85.7 98.5 80.0±1.0
Infinity-Parser 7B* 84.4 83.8 85.0 47.9 88.7 84.2 86.4 99.8 82.5±?
Chandra OCR 0.1.0* 82.2 80.3 88.0 50.4 90.8 81.2 92.3 99.9 83.1±0.9

olmOCR v0.4.0 83.0 82.3 84.9 47.7 96.1 83.7 81.9 99.7 82.4±1.1

Installation

System Dependencies

You will need to install poppler-utils and additional fonts for rendering PDF images.

Install dependencies (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

Python Installation

Set up a conda environment and install olmocr. The requirements for running olmOCR are difficult to install in an existing python environment, so please do make a clean python environment to install into.

conda create -n olmocr python=3.11
conda activate olmocr

Choose the installation option that matches your use case:

Option 1: Remote Inference (Lightweight)

If you plan to use a remote vLLM server with the --server flag, install the base package:

pip install olmocr

This avoids installing heavy GPU dependencies like PyTorch (~2GB+).

Option 2: Local GPU Inference

Requirements: - Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB of GPU RAM - 30GB of free disk space

For running inference with your own GPU:

pip install olmocr[gpu] --extra-index-url https://download.pytorch.org/whl/cu128

# Recommended: Install flash infer for faster inference on GPU
pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl

Option 3: Beaker Cluster Execution

For submitting jobs to Beaker clusters with the --beaker flag:

pip install olmocr[beaker]

Option 4: Benchmark Suite

For running the olmOCR benchmark suite:

pip install olmocr[bench]

Combined Installation

You can combine multiple options:

# GPU + Beaker support
pip install olmocr[gpu,beaker] --extra-index-url https://download.pytorch.org/whl/cu128

# GPU + Benchmark support
pip install olmocr[gpu,bench] --extra-index-url https://download.pytorch.org/whl/cu128

Troubleshooting

If you run into errors about too many open files, update your ulimit:

ulimit -n 65536

Usage Examples

For quick testing, try the web demo.

Convert a Single PDF (Local GPU):

# Download a sample PDF
curl -o olmocr-sample.pdf https://olmocr.allenai.org/papers/olmocr_3pg_sample.pdf

# Convert it to markdown
olmocr ./localworkspace --markdown --pdfs olmocr-sample.pdf

Convert an Image file:

olmocr ./localworkspace --markdown --pdfs random_page.png

Convert Multiple PDFs:

olmocr ./localworkspace --markdown --pdfs tests/gnarly_pdfs/*.pdf

Use Remote Inference Server:

olmocr ./localworkspace --server http://remote-server:8000/v1 --model allenai/olmOCR-2-7B-1025-FP8 --markdown --pdfs *.pdf

With the --markdown flag, results will be stored as markdown files inside of ./localworkspace/markdown/.

Note: You can also use python -m olmocr.pipeline instead of olmocr if you prefer.

Viewing Results

The ./localworkspace/ workspace folder will then have both Dolma and markdown files (if using --markdown).

cat localworkspace/markdown/olmocr-sample.md 
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
...

Using an Inference Provider or External Server

If you have a vLLM server already running elsewhere (or any inference platform implementing the OpenAI API), you can point olmOCR to use it instead of spawning a local instance.

Installation for Remote Inference:

# Lightweight installation - no GPU dependencies needed
pip install olmocr

Using an External Server:

# Use external vLLM server instead of local one
olmocr ./localworkspace --server http://remote-server:8000/v1 --model allenai/olmOCR-2-7B-1025-FP8 --markdown --pdfs tests/gnarly_pdfs/*.pdf

The served model name in vLLM needs to match the value provided in --model.

Example vLLM Server Launch:

vllm serve allenai/olmOCR-2-7B-1025-FP8 --max-model-len 16384

Verified External Providers

We have tested olmOCR-2-7B-1025-FP8 on these external model providers and confirmed that they work

$/1M Input tokens $/1M Output tokens Example Command
Cirrascale $0.07 $0.15 olmocr ./workspace --server https://ai2endpoints.cirrascale.ai/api --api_key sk-XXXXXXX --workers 1 --max_concurrent_requests 20 --model olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf
DeepInfra $0.09 $0.19 olmocr ./workspace --server https://api.deepinfra.com/v1/openai --api_key DfXXXXXXX --workers 1 --max_concurrent_requests 20 --model allenai/olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf
Parasail $0.10 $0.20 olmocr ./workspace --server https://api.parasail.io/v1 --api_key psk-XXXXX --workers 1 --max_concurrent_requests 20 --model allenai/olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf

Notes on arguments - --server: Defines the OpenAI-compatible endpoint: ex https://api.deepinfra.com/v1/openai - --api_key: Your API key, bassed in via Authorization Bearer HTTP header - --max_concurrent_requests: Max concurrent requests that will be in-flight to the inference provider at one time - --workers: Max number of page groups that will be processed at once. You may want to set this to 1 so that you finish one group of stuff before moving on. - --pages_per_group: You may want a smaller number of pages per group as many external provides have lower concurrent request limits - --model: The model identifier, ex. allenai/olmOCR-2-7B-1025, different providers have different names, and if you run locally, you can use olmocr - Other arguments work the same as with local inference

Multi-node / Cluster Usage

If you want to convert millions of PDFs using multiple nodes running in parallel, olmOCR supports reading PDFs from AWS S3 and coordinating work using an AWS S3 output bucket.

Start the first worker node:

olmocr s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

This sets up a simple work queue in your AWS bucket and starts converting PDFs.

On subsequent worker nodes:

olmocr s3://my_s3_bucket/pdfworkspaces/exampleworkspace

They will automatically start grabbing items from the same workspace queue.

Using Beaker for Cluster Execution

If you are at Ai2 and want to linearize millions of PDFs efficiently using beaker,

Core symbols most depended-on inside this repo

get
called by 140
scripts/eval/dolma_refine/registry.py
items
called by 89
scripts/eval/dolma_refine/registry.py
error
called by 88
olmocr/train/dataloader.py
render_equation
called by 81
olmocr/bench/katex/render.py
parse_s3_path
called by 49
olmocr/s3_utils.py
add
called by 46
scripts/eval/dolma_refine/registry.py
render_pdf_to_base64png
called by 41
olmocr/data/renderpdf.py
compare_rendered_equations
called by 40
olmocr/bench/katex/render.py

Shape

Function 798
Method 702
Class 197
Route 20

Languages

Python82%
TypeScript18%

Modules by API surface

olmocr/bench/katex/katex.min.js302 symbols
tests/test_tests.py104 symbols
olmocr/work_queue.py57 symbols
olmocr/train/dataloader.py56 symbols
scripts/pii/pii_rule_comparison.py53 symbols
tests/test_dataloader.py52 symbols
tests/test_mine_html_templates.py47 symbols
tests/test_katex_render.py37 symbols
tests/test_grpo.py36 symbols
olmocr/bench/tests.py31 symbols
olmocr/train/config.py29 symbols
tests/test_anchor.py27 symbols

Dependencies from manifests, versioned

bleach
cached-path
filelock
ftfy
httpx
lingua-language-detector
markdown2
markdownify
orjson

For agents

$ claude mcp add olmocr \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact