hub / github.com/clovaai/donut

github.com/clovaai/donut @1.0.9 sqlite

repository ↗ · DeepWiki ↗ · release 1.0.9 ↗

97 symbols 346 edges 19 files 26 documented · 27%

README

Donut 🍩 : Document Understanding Transformer

Official Implementation of Donut and SynthDoG | Paper | Slide | Poster

Introduction

Donut 🍩, Document understanding transformer, is a new method of document understanding that utilizes an OCR-free end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). In addition, we present SynthDoG 🐶, Synthetic Document Generator, that helps the model pre-training to be flexible on vairous languages and domains.

Our academic paper, which describes our method in detail and provides full experimental results and analyses, can be found here:

OCR-free Document Understanding Transformer.

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. In ECCV 2022.

Pre-trained Models and Web Demos

Gradio web demos are available! |:--:| || - You can run the demo with ./app.py file. - Sample images are available at ./misc and more receipt images are available at CORD dataset link. - Web demos are available from the links in the following table.

Demo

| |---|---|---|---|---| | CORD (Document Parsing) | 0.7 /

0.7 /

1.2 | 91.3 /

91.1 /

90.9 | donut-base-finetuned-cord-v2 (1280) /

donut-base-finetuned-cord-v1 (1280) /

donut-base-finetuned-cord-v1-2560 | gradio space web demo,

google colab demo |

The links to the pre-trained backbones are here: - donut-base: trained with 64 A100 GPUs (~2.5 days), number of layers (encoder: {2,2,14,2}, decoder: 4), input size 2560x1920, swin window size 10, IIT-CDIP (11M) and SynthDoG (English, Chinese, Japanese, Korean, 0.5M x 4). - donut-proto: (preliminary model) trained with 8 V100 GPUs (~5 days), number of layers (encoder: {2,2,18,2}, decoder: 4), input size 2048x1536, swin window size 8, and SynthDoG (English, Japanese, Korean, 0.4M x 3).

Please see our paper for more details.

SynthDoG datasets

The links to the SynthDoG-generated datasets are here:

synthdog-en: English, 0.5M.
synthdog-zh: Chinese, 0.5M.
synthdog-ja: Japanese, 0.5M.
synthdog-ko: Korean, 0.5M.

To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

Updates

2022-11-14 New version 1.0.9 is released (pip install donut-python --upgrade). See 1.0.9 Release Notes.

2022-08-12 Donut 🍩 is also available at huggingface/transformers 🤗 (contributed by @NielsRogge). donut-python loads the pre-trained weights from the official branch of the model repositories. See 1.0.5 Release Notes.

2022-08-05 A well-executed hands-on tutorial on donut 🍩 is published at Towards Data Science (written by @estaudere).

2022-07-20 First Commit, We release our code, model weights, synthetic data and generator.

Software installation

pip install donut-python

or clone this repository and install the dependencies:

git clone https://github.com/clovaai/donut.git
cd donut/
conda create -n donut_official python=3.7
conda activate donut_official
pip install .

We tested donut with: - torch == 1.11.0+cu113 - torchvision == 0.12.0+cu113 - pytorch-lightning == 1.6.4 - transformers == 4.11.3 - timm == 0.5.4

Getting Started

Data

This repository assumes the following structure of dataset:

> tree dataset_name
dataset_name
├── test
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
├── train
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
└── validation
    ├── metadata.jsonl
    ├── {image_path0}
    ├── {image_path1}
              .
              .

> cat dataset_name/test/metadata.jsonl
{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
     .
     .

The structure of metadata.jsonl file is in JSON Lines text format, i.e., .jsonl. Each line consists of
file_name : relative path to the image file.
ground_truth : string format (json dumped), the dictionary contains either gt_parse or gt_parses. Other fields (metadata) can be added to the dictionary but will not be used.
donut interprets all tasks as a JSON prediction problem. As a result, all donut model training share a same pipeline. For training and inference, the only thing to do is preparing gt_parse or gt_parses for the task in format described below.

For Document Classification

The gt_parse follows the format of {"class" : {class_name}}, for example, {"class" : "scientific_report"} or {"class" : "presentation"}. - Google colab demo is available here. - Gradio web demo is available here.

For Document Information Extraction

The gt_parse is a JSON object that contains full information of the document image, for example, the JSON object for a receipt may look like {"menu" : [{"nm": "ICE BLACKCOFFEE", "cnt": "2", ...}, ...], ...}. - More examples are available at CORD dataset. - Google colab demo is available here. - Gradio web demo is available here.

For Document Visual Question Answering

The gt_parses follows the format of [{"question" : {question_sentence}, "answer" : {answer_candidate_1}}, {"question" : {question_sentence}, "answer" : {answer_candidate_2}}, ...], for example, [{"question" : "what is the model name?", "answer" : "donut"}, {"question" : "what is the model name?", "answer" : "document understanding transformer"}]. - DocVQA Task1 has multiple answers, hence gt_parses should be a list of dictionary that contains a pair of question and answer. - Google colab demo is available here. - Gradio web demo is available here.

For (Psuedo) Text Reading Task

The gt_parse looks like {"text_sequence" : "word1 word2 word3 ... "} - This task is also a pre-training task of Donut model. - You can use our SynthDoG 🐶 to generate synthetic images for the text reading task with proper gt_parse. See ./synthdog/README.md for details.

Training

This is the configuration of Donut model training on CORD dataset used in our experiment. We ran this with a single NVIDIA A100 GPU.

```bash python train.py --config config/train_cord.yaml \ --pretrained_model_name_or_path "naver-clova-ix/donut-base" \ --dataset_name_or_paths '["naver-clova-ix/cord-v2"]' \ --exp_version "test_experiment"
. .
Prediction: Lemon Tea (L)125.00025.00030.0005.000 Answer: Lemon Tea (L)125.00025.00030.0005.000 Normed ED: 0.0 Prediction: Hulk Topper Package1100.000100.000100.0000 Answer: Hulk Topper Package1100.000100.000100.0000 Normed ED: 0.0 Prediction: Giant Squidx 1Rp. 39.000C.Finishing - CutRp. 0Rp. 39.000Rp. 39.000Rp. 50.000Rp. 11.000 Answer: Giant Squidx1Rp. 39.000C.Finishing - CutRp. 0

Core symbols most depended-on inside this repo

get

called by 53

synthdog/elements/content.py

construct_tree_from_dict

Shape

Method 70

Class 18

Function 9

Languages

Python100%

Modules by API surface

donut/model.py19 symbols

lightning_module.py15 symbols

donut/util.py15 symbols

synthdog/elements/content.py12 symbols

synthdog/template.py7 symbols

train.py6 symbols

synthdog/layouts/grid_stack.py3 symbols

synthdog/layouts/grid.py3 symbols

synthdog/elements/textbox.py3 symbols

synthdog/elements/paper.py3 symbols

synthdog/elements/document.py3 symbols

synthdog/elements/background.py3 symbols

For agents

$ claude mcp add donut \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact