A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.
Try the online demo: https://olmocr.allenai.org/
Features: - Convert PDF, PNG, and JPEG based documents into clean Markdown - Support for equations, tables, handwriting, and complex formatting - Automatically removes headers and footers - Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets - Efficient, less than $200 USD per million pages converted - (Based on a 7B parameter VLM, so it requires a GPU)
olmOCR-Bench: We also ship a comprehensive benchmark suite covering over 7,000 test cases across 1,400 documents to help measure performance of OCR systems.
| ArXiv | Old scans math | Tables | Old scans | Headers & footers | Multi column | Long tiny text | Base | Overall | |
|---|---|---|---|---|---|---|---|---|---|
| Mistral OCR API | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72.0±1.1 |
| Marker 1.10.1 | 83.8 | 66.8 | 72.9 | 33.5 | 86.6 | 80.0 | 85.7 | 99.3 | 76.1±1.1 |
| MinerU 2.5.4* | 76.6 | 54.6 | 84.9 | 33.7 | 96.6 | 78.2 | 83.5 | 93.7 | 75.2±1.1 |
| DeepSeek-OCR | 77.2 | 73.6 | 80.2 | 33.3 | 96.1 | 66.4 | 79.4 | 99.8 | 75.7±1.0 |
| Nanonets-OCR2-3B | 75.4 | 46.1 | 86.8 | 40.9 | 32.1 | 81.9 | 93.0 | 99.6 | 69.5±1.1 |
| PaddleOCR-VL* | 85.7 | 71.0 | 84.1 | 37.8 | 97.0 | 79.9 | 85.7 | 98.5 | 80.0±1.0 |
| Infinity-Parser 7B* | 84.4 | 83.8 | 85.0 | 47.9 | 88.7 | 84.2 | 86.4 | 99.8 | 82.5±? |
| Chandra OCR 0.1.0* | 82.2 | 80.3 | 88.0 | 50.4 | 90.8 | 81.2 | 92.3 | 99.9 | 83.1±0.9 |
| olmOCR v0.4.0 | 83.0 | 82.3 | 84.9 | 47.7 | 96.1 | 83.7 | 81.9 | 99.7 | 82.4±1.1 |
You will need to install poppler-utils and additional fonts for rendering PDF images.
Install dependencies (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
Set up a conda environment and install olmocr. The requirements for running olmOCR are difficult to install in an existing python environment, so please do make a clean python environment to install into.
conda create -n olmocr python=3.11
conda activate olmocr
Choose the installation option that matches your use case:
Option 1: Remote Inference (Lightweight)
If you plan to use a remote vLLM server with the --server flag, install the base package:
pip install olmocr
This avoids installing heavy GPU dependencies like PyTorch (~2GB+).
Option 2: Local GPU Inference
Requirements: - Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB of GPU RAM - 30GB of free disk space
For running inference with your own GPU:
pip install olmocr[gpu] --extra-index-url https://download.pytorch.org/whl/cu128
# Recommended: Install flash infer for faster inference on GPU
pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl
Option 3: Beaker Cluster Execution
For submitting jobs to Beaker clusters with the --beaker flag:
pip install olmocr[beaker]
Option 4: Benchmark Suite
For running the olmOCR benchmark suite:
pip install olmocr[bench]
Combined Installation
You can combine multiple options:
# GPU + Beaker support
pip install olmocr[gpu,beaker] --extra-index-url https://download.pytorch.org/whl/cu128
# GPU + Benchmark support
pip install olmocr[gpu,bench] --extra-index-url https://download.pytorch.org/whl/cu128
Troubleshooting
If you run into errors about too many open files, update your ulimit:
ulimit -n 65536
For quick testing, try the web demo.
Convert a Single PDF (Local GPU):
# Download a sample PDF
curl -o olmocr-sample.pdf https://olmocr.allenai.org/papers/olmocr_3pg_sample.pdf
# Convert it to markdown
olmocr ./localworkspace --markdown --pdfs olmocr-sample.pdf
Convert an Image file:
olmocr ./localworkspace --markdown --pdfs random_page.png
Convert Multiple PDFs:
olmocr ./localworkspace --markdown --pdfs tests/gnarly_pdfs/*.pdf
Use Remote Inference Server:
olmocr ./localworkspace --server http://remote-server:8000/v1 --model allenai/olmOCR-2-7B-1025-FP8 --markdown --pdfs *.pdf
With the --markdown flag, results will be stored as markdown files inside of ./localworkspace/markdown/.
Note: You can also use
python -m olmocr.pipelineinstead ofolmocrif you prefer.
The ./localworkspace/ workspace folder will then have both Dolma and markdown files (if using --markdown).
cat localworkspace/markdown/olmocr-sample.md
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
...
If you have a vLLM server already running elsewhere (or any inference platform implementing the OpenAI API), you can point olmOCR to use it instead of spawning a local instance.
Installation for Remote Inference:
# Lightweight installation - no GPU dependencies needed
pip install olmocr
Using an External Server:
# Use external vLLM server instead of local one
olmocr ./localworkspace --server http://remote-server:8000/v1 --model allenai/olmOCR-2-7B-1025-FP8 --markdown --pdfs tests/gnarly_pdfs/*.pdf
The served model name in vLLM needs to match the value provided in --model.
Example vLLM Server Launch:
vllm serve allenai/olmOCR-2-7B-1025-FP8 --max-model-len 16384
We have tested olmOCR-2-7B-1025-FP8 on these external model providers and confirmed that they work
| $/1M Input tokens | $/1M Output tokens | Example Command | |
|---|---|---|---|
| Cirrascale | $0.07 | $0.15 | olmocr ./workspace --server https://ai2endpoints.cirrascale.ai/api --api_key sk-XXXXXXX --workers 1 --max_concurrent_requests 20 --model olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf |
| DeepInfra | $0.09 | $0.19 | olmocr ./workspace --server https://api.deepinfra.com/v1/openai --api_key DfXXXXXXX --workers 1 --max_concurrent_requests 20 --model allenai/olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf |
| Parasail | $0.10 | $0.20 | olmocr ./workspace --server https://api.parasail.io/v1 --api_key psk-XXXXX --workers 1 --max_concurrent_requests 20 --model allenai/olmOCR-2-7B-1025 --pdfs tests/gnarly_pdfs/*.pdf |
Notes on arguments
- --server: Defines the OpenAI-compatible endpoint: ex https://api.deepinfra.com/v1/openai
- --api_key: Your API key, bassed in via Authorization Bearer HTTP header
- --max_concurrent_requests: Max concurrent requests that will be in-flight to the inference provider at one time
- --workers: Max number of page groups that will be processed at once. You may want to set this to 1 so that you finish one group of stuff before moving on.
- --pages_per_group: You may want a smaller number of pages per group as many external provides have lower concurrent request limits
- --model: The model identifier, ex. allenai/olmOCR-2-7B-1025, different providers have different names, and if you run locally, you can use olmocr
- Other arguments work the same as with local inference
If you want to convert millions of PDFs using multiple nodes running in parallel, olmOCR supports reading PDFs from AWS S3 and coordinating work using an AWS S3 output bucket.
Start the first worker node:
olmocr s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf
This sets up a simple work queue in your AWS bucket and starts converting PDFs.
On subsequent worker nodes:
olmocr s3://my_s3_bucket/pdfworkspaces/exampleworkspace
They will automatically start grabbing items from the same workspace queue.
If you are at Ai2 and want to linearize millions of PDFs efficiently using beaker,
$ claude mcp add olmocr \
-- python -m otcore.mcp_server <graph>