hub / github.com/allenai/olmocr

github.com/allenai/olmocr @v0.4.27 sqlite

repository ↗ · DeepWiki ↗ · release v0.4.27 ↗

1,717 symbols 6,915 edges 149 files 853 documented · 50% 7 cross-repo links

README

olmocr-2-full@2x

A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.

Try the online demo: https://olmocr.allenai.org/

Features: - Convert PDF, PNG, and JPEG based documents into clean Markdown - Support for equations, tables, handwriting, and complex formatting - Automatically removes headers and footers - Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets - Efficient, less than $200 USD per million pages converted - (Based on a 7B parameter VLM, so it requires a GPU)

News

October 21, 2025 - v0.4.0 - New model release, boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training.
August 13, 2025 - v0.3.0 - New model release, fixes auto-rotation detection, and hallucinations on blank documents.
July 24, 2025 - v0.2.1 - New model release, scores 3 points higher on olmOCR-Bench, also runs significantly faster because it's default FP8, and needs much fewer retries per document.
July 23, 2025 - v0.2.0 - New cleaned up trainer code, makes it much simpler to train olmOCR models yourself.
June 17, 2025 - v0.1.75 - Switch from sglang to vllm based inference pipeline, updated docker image to CUDA 12.8.
May 23, 2025 - v0.1.70 - Official docker support and images are now available! See Docker usage
May 19, 2025 - v0.1.68 - olmOCR-Bench launch, scoring 77.4. Launch includes 2 point performance boost in olmOCR pipeline due to bug fixes with prompts.
Mar 17, 2025 - v0.1.60 - Performance improvements due to better temperature selection in sampling.
Feb 25, 2025 - v0.1.58 - Initial public launch and demo.

Benchmark

olmOCR-Bench: We also ship a comprehensive benchmark suite covering over 7,000 test cases across 1,400 documents to help measure performance of OCR systems.

	ArXiv	Old scans math	Tables	Old scans	Headers & footers	Multi column	Long tiny text	Base	Overall
Mistral OCR API	77.2	67.5	60.6	29.3	93.6	71.3	77.1	99.4	72.0±1.1
Marker 1.10.1	83.8	66.8	72.9	33.5	86.6	80.0	85.7	99.3	76.1±1.1
MinerU 2.5.4*	76.6	54.6	84.9	33.7	96.6	78.2	83.5	93.7	75.2±1.1
DeepSeek-OCR	77.2	73.6	80.2	33.3	96.1	66.4	79.4	99.8	75.7±1.0
Nanonets-OCR2-3B	75.4	46.1	86.8	40.9	32.1	81.9	93.0	99.6	69.5±1.1
PaddleOCR-VL*	85.7	71.0	84.1	37.8	97.0	79.9	85.7	98.5	80.0±1.0
Infinity-Parser 7B*	84.4	83.8	85.0	47.9	88.7	84.2	86.4	99.8	82.5±?
Chandra OCR 0.1.0*	82.2	80.3	88.0	50.4	90.8	81.2	92.3	99.9	83.1±0.9

olmOCR v0.4.0	83.0	82.3	84.9	47.7	96.1	83.7	81.9	99.7	82.4±1.1

Installation

System Dependencies

You will need to install poppler-utils and additional fonts for rendering PDF images.

Install dependencies (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

Python Installation

Set up a conda environment and install olmocr. The requirements for running olmOCR are difficult to install in an existing python environment, so please do make a clean python environment to install into.

conda create -n olmocr python=3.11
conda activate olmocr

Choose the installation option that matches your use case:

Option 1: Remote Inference (Lightweight)

If you plan to use a remote vLLM server with the --server flag, install the base package:

pip install olmocr

This avoids installing heavy GPU dependencies like PyTorch (~2GB+).

Option 2: Local GPU Inference

Requirements: - Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB of GPU RAM - 30GB of free disk space

For running inference with your own GPU:

pip install olmocr[gpu] --extra-index-url https://download.pytorch.org/whl/cu128

# Recommended: Install flash infer for faster inference on GPU
pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl

Option 3: Beaker Cluster Execution

For submitting jobs to Beaker clusters with the --beaker flag:

pip install olmocr[beaker]

Option 4: Benchmark Suite

For running the olmOCR benchmark suite:

pip install olmocr[bench]

Combined Installation

You can combine multiple options:

# GPU + Beaker support
pip install olmocr[gpu,beaker] --extra-index-url https://download.pytorch.org/whl/cu128

# GPU + Benchmark support
pip install olmocr[gpu,bench] --extra-index-url https://download.pytorch.org/whl/cu128

Troubleshooting

If you run into errors about too many open files, update your ulimit:

ulimit -n 65536

Usage Examples

For quick testing, try the web demo.

Convert a Single PDF (Local GPU):

# Download a sample PDF
curl -o olmocr-sample.pdf https://olmocr.allenai.org/papers/olmocr_3pg_sample.pdf

# Convert it to markdown
olmocr ./localworkspace --markdown --pdfs olmocr-sample.pdf

Convert an Image file:

olmocr ./localworkspace --markdown --pdfs random_page.png

Convert Multiple PDFs:

olmocr ./localworkspace --markdown --pdfs tests/gnarly_pdfs/*.pdf

Use Remote Inference Server:

olmocr ./localworkspace --server http://remote-server:8000/v1 --model allenai/olmOCR-2-7B-1025-FP8 --markdown --pdfs *.pdf

With the --markdown flag, results will be stored as markdown files inside of ./localworkspace/markdown/.

Note: You can also use python -m olmocr.pipeline instead of olmocr if you prefer.

Viewing Results

The ./localworkspace/ workspace folder will then have both Dolma and markdown files (if using --markdown).

cat localworkspace/markdown/olmocr-sample.md

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
...

Using an Inference Provider or External Server

If you have a vLLM server already running elsewhere (or any inference platform implementing the OpenAI API), you can point olmOCR to use it instead of spawning a local instance.

Installation for Remote Inference:

# Lightweight installation - no GPU dependencies needed
pip install olmocr

Using an External Server:

# Use external vLLM server instead of local one
olmocr ./localworkspace --server http://remote-server:8000/v1 --model allenai/olmOCR-2-7B-1025-FP8 --markdown --pdfs tests/gnarly_pdfs/*.pdf

The served model name in vLLM needs to match the value provided in --model.

Example vLLM Server Launch:

vllm serve allenai/olmOCR-2-7B-1025-FP8 --max-model-len 16384

Verified External Providers

We have tested olmOCR-2-7B-1025-FP8 on these external model providers and confirmed that they work

	$/1M Input tokens	$/1M Output tokens	Example Command
Cirrascale	$0.07	$0.15	`olmocr ./workspace --server https://ai2endpoints.cirrascale.ai/api --api_key sk-XXXXXXX --workers 1 --max_concurrent_requests 20 --model olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf`
DeepInfra	$0.09	$0.19	`olmocr ./workspace --server https://api.deepinfra.com/v1/openai --api_key DfXXXXXXX --workers 1 --max_concurrent_requests 20 --model allenai/olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf`
Parasail	$0.10	$0.20	`olmocr ./workspace --server https://api.parasail.io/v1 --api_key psk-XXXXX --workers 1 --max_concurrent_requests 20 --model allenai/olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf`

Notes on arguments - --server: Defines the OpenAI-compatible endpoint: ex https://api.deepinfra.com/v1/openai - --api_key: Your API key, bassed in via Authorization Bearer HTTP header - --max_concurrent_requests: Max concurrent requests that will be in-flight to the inference provider at one time - --workers: Max number of page groups that will be processed at once. You may want to set this to 1 so that you finish one group of stuff before moving on. - --pages_per_group: You may want a smaller number of pages per group as many external provides have lower concurrent request limits - --model: The model identifier, ex. allenai/olmOCR-2-7B-1025, different providers have different names, and if you run locally, you can use olmocr - Other arguments work the same as with local inference

Multi-node / Cluster Usage

If you want to convert millions of PDFs using multiple nodes running in parallel, olmOCR supports reading PDFs from AWS S3 and coordinating work using an AWS S3 output bucket.

Start the first worker node:

olmocr s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

This sets up a simple work queue in your AWS bucket and starts converting PDFs.

On subsequent worker nodes:

olmocr s3://my_s3_bucket/pdfworkspaces/exampleworkspace

They will automatically start grabbing items from the same workspace queue.

Using Beaker for Cluster Execution

If you are at Ai2 and want to linearize millions of PDFs efficiently using beaker,

Core symbols most depended-on inside this repo

get

called by 140

scripts/eval/dolma_refine/registry.py

items

called by 89

scripts/eval/dolma_refine/registry.py

error

called by 88

olmocr/train/dataloader.py

render_equation

called by 81

olmocr/bench/katex/render.py

scripts/eval/dolma_refine/registry.py

render_pdf_to_base64png

called by 41

olmocr/data/renderpdf.py

compare_rendered_equations

called by 40

olmocr/bench/katex/render.py

Shape

Function 798

Method 702

Class 197

Route 20

Languages

Python82%

TypeScript18%

Modules by API surface

olmocr/bench/katex/katex.min.js302 symbols

tests/test_tests.py104 symbols

olmocr/work_queue.py57 symbols

olmocr/train/dataloader.py56 symbols

scripts/pii/pii_rule_comparison.py53 symbols

tests/test_dataloader.py52 symbols

tests/test_mine_html_templates.py47 symbols

tests/test_katex_render.py37 symbols

tests/test_grpo.py36 symbols

olmocr/bench/tests.py31 symbols

olmocr/train/config.py29 symbols

tests/test_anchor.py27 symbols

Dependencies from manifests, versioned

Pillow1×

bleach1×

boto31×

cached-path1×

cryptography1×

filelock1×

ftfy1×

httpx1×

lingua-language-detector1×

markdown21×

markdownify1×

orjson1×

For agents

$ claude mcp add olmocr \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact