hub / github.com/Yuliang-Liu/MonkeyOCR

github.com/Yuliang-Liu/MonkeyOCR @main

repository ↗ · DeepWiki ↗ · Ask this repo → · + Follow

558 symbols 1,708 edges 94 files 198 documented · 35% ● updated 58d ago★ 6,597

README

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai

Multimodal OCR: Parse Anything from Documents

Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Yuqiu Ji, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu, Xiang Bai

News

2026.04.01 🚀 dots.mocr achieves the best open-source score on MDPBench, a 17-language document parsing benchmark!
2026.03.15 🚀 We release dots.mocr Multimodal OCR: Parse Anything from Documents.
2026.01.30 🚀 We release MonkeyDoc and provide the necessary details of our data generation pipeline.
2025.11.14 🚀 We release MonkeyOCR-v1.5 Technical Report, achieving the best document parsing performance to date. Demo.
2025.07.10 🚀 We release MonkeyOCR-pro-1.2B, a leaner and faster version model that outperforms our previous 3B version in accuracy, speed, and efficiency.
2025.06.12 🚀 The model’s trending on Hugging Face #2.
2025.06.05 🚀 We release MonkeyOCR, an English and Chinese documents parsing model.

Introduction

MonkeyOCR adopts a Structure-Recognition-Relation (SRR) triplet paradigm, which simplifies the multi-tool pipeline of modular approaches while avoiding the inefficiency of using large multimodal models for full-page document processing.

MonkeyOCR-pro-1.2B surpasses MonkeyOCR-3B by 7.4% on Chinese documents.
MonkeyOCR-pro-1.2B delivers approximately a 36% speed improvement over MonkeyOCR-pro-3B, with approximately 1.6% drop in performance.
On olmOCR-Bench, MonkeyOCR-pro-1.2B outperforms Nanonets-OCR-3B by 7.3%.
On OmniDocBench, MonkeyOCR-pro-3B achieves the best overall performance on both English and Chinese documents, outperforming even closed-source and extra-large open-source VLMs such as Gemini 2.0-Flash, Gemini 2.5-Pro, Qwen2.5-VL-72B, GPT-4o, and InternVL3-78B.

See detailed results below.

Comparing MonkeyOCR with closed-source and extra large open-source VLMs.

Inference Speed (Pages/s) on Different GPUs and PDF Page Counts

Model	GPU	50 Pages	100 Pages	300 Pages	500 Pages	1000 Pages
MonkeyOCR-pro-3B	3090	0.492	0.484	0.497	0.492	0.496
A6000	0.585	0.587	0.609	0.598	0.608
H800	0.923	0.768	0.897	0.930	0.891
4090	0.972	0.969	1.006	0.986	1.006
MonkeyOCR-pro-1.2B	3090	0.615	0.660	0.677	0.687	0.683
A6000	0.709	0.786	0.825	0.829	0.825
H800	0.965	1.082	1.101	1.145	1.015
4090	1.194	1.314	1.436	1.442	1.434

VLM OCR Speed (Pages/s) on Different GPUs and PDF Page Counts

Model	GPU	50 Pages	100 Pages	300 Pages	500 Pages	1000 Pages
MonkeyOCR-pro-3B	3090	0.705	0.680	0.711	0.700	0.724
A6000	0.885	0.860	0.915	0.892	0.934
H800	1.371	1.135	1.339	1.433	1.509
4090	1.321	1.300	1.384	1.343	1.410
MonkeyOCR-pro-1.2B	3090	0.919	1.086	1.166	1.182	1.199
A6000	1.177	1.361	1.506	1.525	1.569
H800	1.466	1.719	1.763	1.875	1.650
4090	1.759	1.987	2.260	2.345	2.415

Supported Hardware

Due to the limited types of GPUs available to us, we may not be able to provide highly accurate hardware specifications. We've tested the model on GPUs such as the 3090, 4090, A6000, H800, A100, and even the 4060 with 8GB of VRAM (suitable for deploying quantized 3B model and 1.2B model). We are very grateful for the feedback and contributions from the open-source community, who have also successfully run the model on 50-series GPUs, H200, L20, V100, 2080 Ti and npu.

Quick Start

Locally Install

1. Install MonkeyOCR

See the installation guide to set up your environment.

2. Download Model Weights

Download our model from Huggingface.

pip install huggingface_hub

python tools/download_model.py -n MonkeyOCR-pro-3B # or MonkeyOCR-pro-1.2B, MonkeyOCR

You can also download our model from ModelScope.

pip install modelscope

python tools/download_model.py -t modelscope -n MonkeyOCR-pro-3B  # or MonkeyOCR-pro-1.2B, MonkeyOCR

3. Inference

You can parse a file or a directory containing PDFs or images using the following commands:

# Replace input_path with the path to a PDF or image or directory

# End-to-end parsing
python parse.py input_path

# Parse files in a dir with specific group page num
python parse.py input_path -g 20

# Single-task recognition (outputs markdown only)
python parse.py input_path -t text/formula/table

# Parse PDFs in input_path and split results by pages
python parse.py input_path -s

# Specify output directory and model config file
python parse.py input_path -o ./output -c config.yaml

More usage examples

# Single file processing
python parse.py input.pdf                           # Parse single PDF file
python parse.py input.pdf -o ./output               # Parse with custom output dir
python parse.py input.pdf -s                        # Parse PDF with page splitting
python parse.py image.jpg                           # Parse single image file

# Single task recognition
python parse.py image.jpg -t text                   # Text recognition from image
python parse.py image.jpg -t formula                # Formula recognition from image
python parse.py image.jpg -t table                  # Table recognition from image
python parse.py document.pdf -t text                # Text recognition from all PDF pages

# Folder processing (all files individually)
python parse.py /path/to/folder                     # Parse all files in folder
python parse.py /path/to/folder -s                  # Parse with page splitting
python parse.py /path/to/folder -t text             # Single task recognition for all files

# Multi-file grouping (batch processing by page count)
python parse.py /path/to/folder -g 5                # Group files with max 5 total pages
python parse.py /path/to/folder -g 10 -s            # Group files with page splitting
python parse.py /path/to/folder -g 8 -t text        # Group files for single task recognition

# Advanced configurations
python parse.py input.pdf -c model_configs.yaml     # Custom model configuration
python parse.py /path/to/folder -g 15 -s -o ./out   # Group files, split pages, custom output
python parse.py input.pdf --pred-abandon            # Enable predicting abandon elements
  python parse.py /path/to/folder -g 10 -m            # Group files and merge text blocks in output

Output Results

MonkeyOCR mainly generates three types of output files:

Processed Markdown File (your.md): The final parsed document content in markdown format, containing text, formulas, tables, and other structured elements.
Layout Results (your_layout.pdf): The layout results drawed on origin PDF.
Intermediate Block Results (your_middle.json): A JSON file containing detailed information about all detected blocks, including:
Block coordinates and positions
Block content and type information
Relationship information between blo

Core symbols most depended-on inside this repo

read

called by 28

magic_pdf/data/io/s3.py

draw_bbox_without_number

called by 18

magic_pdf/libs/draw_bbox.py

merge_para_with_text

called by 14

magic_pdf/dict2md/ocr_mkcontent.py

draw_bbox_with_number

called by 12

magic_pdf/libs/draw_bbox.py

write

called by 11

magic_pdf/data/io/s3.py

add_bboxes

called by 10

magic_pdf/pre_proc/ocr_detect_all_bboxes.py

bbox_distance

called by 9

magic_pdf/libs/boxbase.py

dump_md

called by 8

magic_pdf/operators/pipes_llm.py

Shape

Function 254

Method 231

Class 66

Route 7

Languages

Python100%

Modules by API surface

magic_pdf/data/dataset.py60 symbols

magic_pdf/model/custom_model.py50 symbols

api/main.py39 symbols

magic_pdf/libs/boxbase.py27 symbols

magic_pdf/pdf_parse_union_core_v2_llm.py26 symbols

magic_pdf/model/magic_model.py24 symbols

magic_pdf/config/exceptions.py15 symbols

demo/demo_gradio.py15 symbols

magic_pdf/operators/pipes_llm.py13 symbols

magic_pdf/filter/pdf_meta_scan.py13 symbols

tools/lmdeploy_patcher.py12 symbols

magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py10 symbols

Dependencies from manifests, versioned

Brotli1.1.0 · 1×

PyMuPDF1.24.9 · 1×

boto31.28.43 · 1×

click8.1.7 · 1×

dill0.3.8 · 1×

doclayout_yolo0.0.2b1 · 1×

fast-langdetect0.2.3 · 1×

fastapi0.104.1 · 1×

gradio5.23.3 · 1×

loguru0.6.0 · 1×

numpy1.21.6 · 1×

openai2.6.1 · 1×

For agents

$ claude mcp add MonkeyOCR \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact