MCPcopy Index your code
hub / github.com/Yuliang-Liu/MonkeyOCR

github.com/Yuliang-Liu/MonkeyOCR @main

repository ↗ · DeepWiki ↗ · Ask this repo → · + Follow
558 symbols 1,708 edges 94 files 198 documented · 35% updated 58d ago★ 6,597
README

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

arXiv HuggingFace GitHub issues GitHub closed issues License GitHub views

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai

arXiv Source_code Model Weight Model Weight Public Courses Demo

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai

arXiv Demo

Multimodal OCR: Parse Anything from Documents

Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Yuqiu Ji, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu, Xiang Bai

arXiv Source_code Youtube Youtube Wechat HyperAI Demo

News

  • 2026.04.01 🚀 dots.mocr achieves the best open-source score on MDPBench, a 17-language document parsing benchmark!
  • 2026.03.15 🚀 We release dots.mocr Multimodal OCR: Parse Anything from Documents.
  • 2026.01.30 🚀 We release MonkeyDoc and provide the necessary details of our data generation pipeline.
  • 2025.11.14 🚀 We release MonkeyOCR-v1.5 Technical Report, achieving the best document parsing performance to date. Demo.
  • 2025.07.10 🚀 We release MonkeyOCR-pro-1.2B, a leaner and faster version model that outperforms our previous 3B version in accuracy, speed, and efficiency.
  • 2025.06.12 🚀 The model’s trending on Hugging Face #2.
  • 2025.06.05 🚀 We release MonkeyOCR, an English and Chinese documents parsing model.

Introduction

MonkeyOCR adopts a Structure-Recognition-Relation (SRR) triplet paradigm, which simplifies the multi-tool pipeline of modular approaches while avoiding the inefficiency of using large multimodal models for full-page document processing.

  1. MonkeyOCR-pro-1.2B surpasses MonkeyOCR-3B by 7.4% on Chinese documents.
  2. MonkeyOCR-pro-1.2B delivers approximately a 36% speed improvement over MonkeyOCR-pro-3B, with approximately 1.6% drop in performance.
  3. On olmOCR-Bench, MonkeyOCR-pro-1.2B outperforms Nanonets-OCR-3B by 7.3%.
  4. On OmniDocBench, MonkeyOCR-pro-3B achieves the best overall performance on both English and Chinese documents, outperforming even closed-source and extra-large open-source VLMs such as Gemini 2.0-Flash, Gemini 2.5-Pro, Qwen2.5-VL-72B, GPT-4o, and InternVL3-78B.

See detailed results below.

Comparing MonkeyOCR with closed-source and extra large open-source VLMs.

Inference Speed (Pages/s) on Different GPUs and PDF Page Counts

Model GPU 50 Pages 100 Pages 300 Pages 500 Pages 1000 Pages
MonkeyOCR-pro-3B 3090 0.492 0.484 0.497 0.492 0.496
A6000 0.585 0.587 0.609 0.598 0.608
H800 0.923 0.768 0.897 0.930 0.891
4090 0.972 0.969 1.006 0.986 1.006
MonkeyOCR-pro-1.2B 3090 0.615 0.660 0.677 0.687 0.683
A6000 0.709 0.786 0.825 0.829 0.825
H800 0.965 1.082 1.101 1.145 1.015
4090 1.194 1.314 1.436 1.442 1.434

VLM OCR Speed (Pages/s) on Different GPUs and PDF Page Counts

Model GPU 50 Pages 100 Pages 300 Pages 500 Pages 1000 Pages
MonkeyOCR-pro-3B 3090 0.705 0.680 0.711 0.700 0.724
A6000 0.885 0.860 0.915 0.892 0.934
H800 1.371 1.135 1.339 1.433 1.509
4090 1.321 1.300 1.384 1.343 1.410
MonkeyOCR-pro-1.2B 3090 0.919 1.086 1.166 1.182 1.199
A6000 1.177 1.361 1.506 1.525 1.569
H800 1.466 1.719 1.763 1.875 1.650
4090 1.759 1.987 2.260 2.345 2.415

Supported Hardware

Due to the limited types of GPUs available to us, we may not be able to provide highly accurate hardware specifications. We've tested the model on GPUs such as the 3090, 4090, A6000, H800, A100, and even the 4060 with 8GB of VRAM (suitable for deploying quantized 3B model and 1.2B model). We are very grateful for the feedback and contributions from the open-source community, who have also successfully run the model on 50-series GPUs, H200, L20, V100, 2080 Ti and npu.

Quick Start

Locally Install

1. Install MonkeyOCR

See the installation guide to set up your environment.

2. Download Model Weights

Download our model from Huggingface.

pip install huggingface_hub

python tools/download_model.py -n MonkeyOCR-pro-3B # or MonkeyOCR-pro-1.2B, MonkeyOCR

You can also download our model from ModelScope.

pip install modelscope

python tools/download_model.py -t modelscope -n MonkeyOCR-pro-3B  # or MonkeyOCR-pro-1.2B, MonkeyOCR

3. Inference

You can parse a file or a directory containing PDFs or images using the following commands:

# Replace input_path with the path to a PDF or image or directory

# End-to-end parsing
python parse.py input_path

# Parse files in a dir with specific group page num
python parse.py input_path -g 20

# Single-task recognition (outputs markdown only)
python parse.py input_path -t text/formula/table

# Parse PDFs in input_path and split results by pages
python parse.py input_path -s

# Specify output directory and model config file
python parse.py input_path -o ./output -c config.yaml

More usage examples

# Single file processing
python parse.py input.pdf                           # Parse single PDF file
python parse.py input.pdf -o ./output               # Parse with custom output dir
python parse.py input.pdf -s                        # Parse PDF with page splitting
python parse.py image.jpg                           # Parse single image file

# Single task recognition
python parse.py image.jpg -t text                   # Text recognition from image
python parse.py image.jpg -t formula                # Formula recognition from image
python parse.py image.jpg -t table                  # Table recognition from image
python parse.py document.pdf -t text                # Text recognition from all PDF pages

# Folder processing (all files individually)
python parse.py /path/to/folder                     # Parse all files in folder
python parse.py /path/to/folder -s                  # Parse with page splitting
python parse.py /path/to/folder -t text             # Single task recognition for all files

# Multi-file grouping (batch processing by page count)
python parse.py /path/to/folder -g 5                # Group files with max 5 total pages
python parse.py /path/to/folder -g 10 -s            # Group files with page splitting
python parse.py /path/to/folder -g 8 -t text        # Group files for single task recognition

# Advanced configurations
python parse.py input.pdf -c model_configs.yaml     # Custom model configuration
python parse.py /path/to/folder -g 15 -s -o ./out   # Group files, split pages, custom output
python parse.py input.pdf --pred-abandon            # Enable predicting abandon elements
  python parse.py /path/to/folder -g 10 -m            # Group files and merge text blocks in output

Output Results

MonkeyOCR mainly generates three types of output files:

  1. Processed Markdown File (your.md): The final parsed document content in markdown format, containing text, formulas, tables, and other structured elements.
  2. Layout Results (your_layout.pdf): The layout results drawed on origin PDF.
  3. Intermediate Block Results (your_middle.json): A JSON file containing detailed information about all detected blocks, including:
  4. Block coordinates and positions
  5. Block content and type information
  6. Relationship information between blo

Core symbols most depended-on inside this repo

read
called by 28
magic_pdf/data/io/s3.py
draw_bbox_without_number
called by 18
magic_pdf/libs/draw_bbox.py
merge_para_with_text
called by 14
magic_pdf/dict2md/ocr_mkcontent.py
draw_bbox_with_number
called by 12
magic_pdf/libs/draw_bbox.py
write
called by 11
magic_pdf/data/io/s3.py
add_bboxes
called by 10
magic_pdf/pre_proc/ocr_detect_all_bboxes.py
bbox_distance
called by 9
magic_pdf/libs/boxbase.py
dump_md
called by 8
magic_pdf/operators/pipes_llm.py

Shape

Function 254
Method 231
Class 66
Route 7

Languages

Python100%

Modules by API surface

magic_pdf/data/dataset.py60 symbols
magic_pdf/model/custom_model.py50 symbols
api/main.py39 symbols
magic_pdf/libs/boxbase.py27 symbols
magic_pdf/pdf_parse_union_core_v2_llm.py26 symbols
magic_pdf/model/magic_model.py24 symbols
magic_pdf/config/exceptions.py15 symbols
demo/demo_gradio.py15 symbols
magic_pdf/operators/pipes_llm.py13 symbols
magic_pdf/filter/pdf_meta_scan.py13 symbols
tools/lmdeploy_patcher.py12 symbols
magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py10 symbols

Dependencies from manifests, versioned

Brotli1.1.0 · 1×
PyMuPDF1.24.9 · 1×
boto31.28.43 · 1×
click8.1.7 · 1×
dill0.3.8 · 1×
doclayout_yolo0.0.2b1 · 1×
fast-langdetect0.2.3 · 1×
fastapi0.104.1 · 1×
gradio5.23.3 · 1×
loguru0.6.0 · 1×
numpy1.21.6 · 1×
openai2.6.1 · 1×

For agents

$ claude mcp add MonkeyOCR \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact