Official Implementation of Donut and SynthDoG | Paper | Slide | Poster
Donut 🍩, Document understanding transformer, is a new method of document understanding that utilizes an OCR-free end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). In addition, we present SynthDoG 🐶, Synthetic Document Generator, that helps the model pre-training to be flexible on vairous languages and domains.
Our academic paper, which describes our method in detail and provides full experimental results and analyses, can be found here:
OCR-free Document Understanding Transformer.
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. In ECCV 2022.

Gradio web demos are available!
|:--:|
|
|
- You can run the demo with ./app.py file.
- Sample images are available at ./misc and more receipt images are available at CORD dataset link.
- Web demos are available from the links in the following table.
|Task|Sec/Img|Score|Trained Model|
Demo
| |---|---|---|---|---| | CORD (Document Parsing) | 0.7 /
0.7 /
1.2 | 91.3 /
91.1 /
90.9 | donut-base-finetuned-cord-v2 (1280) /
donut-base-finetuned-cord-v1 (1280) /
donut-base-finetuned-cord-v1-2560 | gradio space web demo,
google colab demo | | Train Ticket (Document Parsing) | 0.6 | 98.7 | donut-base-finetuned-zhtrainticket | google colab demo | | RVL-CDIP (Document Classification) | 0.75 | 95.3 | donut-base-finetuned-rvlcdip | gradio space web demo,
google colab demo | | DocVQA Task1 (Document VQA) | 0.78 | 67.5 | donut-base-finetuned-docvqa | gradio space web demo,
The links to the pre-trained backbones are here:
- donut-base: trained with 64 A100 GPUs (~2.5 days), number of layers (encoder: {2,2,14,2}, decoder: 4), input size 2560x1920, swin window size 10, IIT-CDIP (11M) and SynthDoG (English, Chinese, Japanese, Korean, 0.5M x 4).
- donut-proto: (preliminary model) trained with 8 V100 GPUs (~5 days), number of layers (encoder: {2,2,18,2}, decoder: 4), input size 2048x1536, swin window size 8, and SynthDoG (English, Japanese, Korean, 0.4M x 3).
Please see our paper for more details.

The links to the SynthDoG-generated datasets are here:
synthdog-en: English, 0.5M.synthdog-zh: Chinese, 0.5M.synthdog-ja: Japanese, 0.5M.synthdog-ko: Korean, 0.5M.To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.
2022-11-14 New version 1.0.9 is released (pip install donut-python --upgrade). See 1.0.9 Release Notes.
2022-08-12 Donut 🍩 is also available at huggingface/transformers 🤗 (contributed by @NielsRogge). donut-python loads the pre-trained weights from the official branch of the model repositories. See 1.0.5 Release Notes.
2022-08-05 A well-executed hands-on tutorial on donut 🍩 is published at Towards Data Science (written by @estaudere).
2022-07-20 First Commit, We release our code, model weights, synthetic data and generator.
pip install donut-python
or clone this repository and install the dependencies:
git clone https://github.com/clovaai/donut.git
cd donut/
conda create -n donut_official python=3.7
conda activate donut_official
pip install .
We tested donut with: - torch == 1.11.0+cu113 - torchvision == 0.12.0+cu113 - pytorch-lightning == 1.6.4 - transformers == 4.11.3 - timm == 0.5.4
This repository assumes the following structure of dataset:
> tree dataset_name
dataset_name
├── test
│ ├── metadata.jsonl
│ ├── {image_path0}
│ ├── {image_path1}
│ .
│ .
├── train
│ ├── metadata.jsonl
│ ├── {image_path0}
│ ├── {image_path1}
│ .
│ .
└── validation
├── metadata.jsonl
├── {image_path0}
├── {image_path1}
.
.
> cat dataset_name/test/metadata.jsonl
{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
.
.
metadata.jsonl file is in JSON Lines text format, i.e., .jsonl. Each line consists offile_name : relative path to the image file.ground_truth : string format (json dumped), the dictionary contains either gt_parse or gt_parses. Other fields (metadata) can be added to the dictionary but will not be used.donut interprets all tasks as a JSON prediction problem. As a result, all donut model training share a same pipeline. For training and inference, the only thing to do is preparing gt_parse or gt_parses for the task in format described below.The gt_parse follows the format of {"class" : {class_name}}, for example, {"class" : "scientific_report"} or {"class" : "presentation"}.
- Google colab demo is available here.
- Gradio web demo is available here.
The gt_parse is a JSON object that contains full information of the document image, for example, the JSON object for a receipt may look like {"menu" : [{"nm": "ICE BLACKCOFFEE", "cnt": "2", ...}, ...], ...}.
- More examples are available at CORD dataset.
- Google colab demo is available here.
- Gradio web demo is available here.
The gt_parses follows the format of [{"question" : {question_sentence}, "answer" : {answer_candidate_1}}, {"question" : {question_sentence}, "answer" : {answer_candidate_2}}, ...], for example, [{"question" : "what is the model name?", "answer" : "donut"}, {"question" : "what is the model name?", "answer" : "document understanding transformer"}].
- DocVQA Task1 has multiple answers, hence gt_parses should be a list of dictionary that contains a pair of question and answer.
- Google colab demo is available here.
- Gradio web demo is available here.
The gt_parse looks like {"text_sequence" : "word1 word2 word3 ... "}
- This task is also a pre-training task of Donut model.
- You can use our SynthDoG 🐶 to generate synthetic images for the text reading task with proper gt_parse. See ./synthdog/README.md for details.
This is the configuration of Donut model training on CORD dataset used in our experiment. We ran this with a single NVIDIA A100 GPU.
```bash
python train.py --config config/train_cord.yaml \
--pretrained_model_name_or_path "naver-clova-ix/donut-base" \
--dataset_name_or_paths '["naver-clova-ix/cord-v2"]' \
--exp_version "test_experiment"
.
.
Prediction: Lemon Tea (L)125.00025.00030.0005.000
Answer: Lemon Tea (L)125.00025.00030.0005.000
Normed ED: 0.0
Prediction: Hulk Topper Package1100.000100.000100.0000
Answer: Hulk Topper Package1100.000100.000100.0000
Normed ED: 0.0
Prediction: Giant Squidx 1Rp. 39.000C.Finishing - CutRp. 0Rp. 39.000Rp. 39.000Rp. 50.000Rp. 11.000
Answer: Giant Squidx1Rp. 39.000C.Finishing - CutRp. 0
$ claude mcp add donut \
-- python -m otcore.mcp_server <graph>