MCPcopy
hub / github.com/datalab-to/surya

github.com/datalab-to/surya @v0.20.0 sqlite

repository ↗ · DeepWiki ↗ · release v0.20.0 ↗
405 symbols 1,452 edges 69 files 87 documented · 21%
README

Datalab Logo

Datalab

State of the Art models for Document Intelligence

Code License Model License Discord

Homepage Docs Datalab Playground


Surya

Surya is a 650M param OCR model with these features:

  • Accuracy - scores 83.3% on olmOCR-bench (top under 3B params)
  • Speed - throughput of 5 pages/s on an RTX 5090
  • Multilingual - scores 87.2% on an internal benchmark set of 91 languages (more here)
  • Layout analysis (table, image, header, etc.) with reading order
  • Table recognition (rows + columns)

We also ship smaller models for line-level text detection and ocr error detection. It works on a range of documents (see usage and benchmarks).

Try Datalab's Managed Platform

Our managed platform runs both Surya, and variants of our highest accuracy model, Chandra.

Get started with $5 in free creditssign up (takes under 30 seconds) or try our free public playground.

Model Information

Detection OCR
Layout Table Recognition

Surya is named for the Hindu sun god, who has universal vision.

Examples

Each row links to five annotated views of the same page: text-line detection, OCR, layout, reading order, and (when present) table recognition.

Name Detection OCR Layout Order Table Rec
Newspaper Image Image Image Image
Textbook Image Image Image Image
Tax Form Image Image Image Image Image
Handwritten Notes Image Image Image Image Image
Corporate Doc Image Image Image Image Image

Commercial usage

The Surya code is licensed under Apache 2.0. The model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $5M funding/revenue). For broader commercial licensing of the model weights, visit our pricing page here.

Installation

Install with:

pip install surya-ocr

Inference backend prerequisites

Surya auto-spawns the server on first use, and you need vllm (NVIDIA GPU) or llama.cpp (CPU / Apple Silicon):

  • NVIDIA GPU: Docker plus the NVIDIA Container Toolkit.
  • CPU / Apple Silicon: the llama-server binary from llama.cpp: shell brew install llama.cpp # macOS # or grab a release from https://github.com/ggml-org/llama.cpp/releases

Upgrading from Surya v1

If you have v1 code, you can migrate to this:

# v2
from surya.inference import SuryaInferenceManager
from surya.recognition import RecognitionPredictor

manager = SuryaInferenceManager()              # auto-spawns vllm or llama-server
rec = RecognitionPredictor(manager)
predictions = rec([image])

What's different: - SuryaInferenceManager replaces FoundationPredictor. Same manager instance is shared across LayoutPredictor, RecognitionPredictor, TableRecPredictor. - Output schemas changed: see the per-section JSON tables below. Highlights — text_linesblocks (with html); layout dropped top_k, added count; table_rec dropped is_header / colspan / rowspan from cells.

Usage

Surya 2 runs layout, OCR, and table recognition through a single VLM. The inference manager will spawn one for you on first use; you can also point it at an existing server via SURYA_INFERENCE_URL=http://host:port/v1.

  • Inspect the settings in surya/settings.py. You can override any setting via env var (e.g. SURYA_INFERENCE_BACKEND=vllm).
  • Text detection and OCR errors are separate models.

Server lifecycle (--keep_server)

By default each command spawns the VLM server on startup and shuts it down on exit — so running several commands in a row pays the startup (and, on GPU, the model-load) cost every time. Pass --keep_server to leave the server running so later commands attach to it instead of re-spawning:

surya_ocr    DATA_PATH --keep_server   # spawns the server and leaves it up
surya_layout DATA_PATH                 # attaches to the running server
surya_table  DATA_PATH                 # ...and so on, no re-spawn

--keep_server works on every command. Stop the server when you're done (docker stop the surya-vllm-* container, or kill the llama-server process), or set SURYA_INFERENCE_KEEP_ALIVE=1 to make keep-alive the default.

Interactive App

I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:

pip install streamlit pdftext
surya_gui

OCR (text recognition)

This command will write out a json file with the detected text and bboxes:

surya_ocr DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --images will save images of the pages and detected blocks (optional)
  • --output_dir specifies the directory to save results to instead of the default
  • --page_range specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: 0,5-10,20.
  • --keep_server leaves the inference server running after the command exits so later commands reuse it (see Server lifecycle). Available on every command.

The results.json file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:

  • blocks - per-block OCR results in reading order
  • label - canonicalized layout label (e.g. Text, SectionHeader, Table, Equation, Picture, Form, PageHeader, ...). See surya/layout/label.py:LAYOUT_PRED_RELABEL for the full canonical-name set.
  • raw_label - original label emitted by the model, before canonicalization
  • reading_order - 0-indexed position in layout output
  • html - block content as HTML (math wrapped in <math>...</math>, tables as <table>...</table>, etc.). "" if the block was skipped
  • polygon - 4-corner polygon in [[x0,y0],[x1,y0],[x1,y1],[x0,y1]] order
  • bbox - axis-aligned [x0, y0, x1, y1] derived from the polygon
  • confidence - mean per-token probability across the block's decode (0-1)
  • skipped - true if the block was a visual label (e.g. Picture) and not OCR'd
  • error - true if the block OCR call failed
  • image_bbox - [0, 0, width, height] for the page image

Performance tips

  • Throughput is governed by the inference backend. With vllm, raise --max-num-seqs / --max-num-batched-tokens (or SURYA_INFERENCE_PARALLEL on the client side) to keep more pages in flight. With llama.cpp, set SURYA_INFERENCE_PARALLEL to match --parallel on llama-server.
  • DPI can also impact throughput significantly - you can adjust the DPI settings to make the right throughput/accuracy tradeoff for your usecase. Try going from 192 to 96 for improved throughput.
  • MTP can also impact latency/throughput - you can adjust the vllm mtp config in settings.

From python

from PIL import Image
from surya.inference import SuryaInferenceManager
from surya.recognition import RecognitionPredictor

manager = SuryaInferenceManager()
recognition_predictor = RecognitionPredictor(manager)

# Default: full-page OCR. One VLM call per page. Returns one PageOCRResult per
# image: `.blocks` (each with label, html, polygon, bbox, confidence, ...) and
# `.image_bbox` — the same schema as block mode.
predictions = recognition_predictor([Image.open(IMAGE_PATH)])

# Block mode: pre-run layout, then per-block OCR. Same return schema as above.
# Auto-selected when `layout_results` is passed.
from surya.layout import LayoutPredictor
layout = LayoutPredictor(manager)
layouts = layout([Image.open(IMAGE_PATH)])
predictions = recognition_predictor([Image.open(IMAGE_PATH)], layouts)

Text line detection

This command will write out a json file with the detected bboxes.

surya_detect DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --images will save images of the pages and detected text lines (optional)
  • --output_dir specifies the directory to save results to instead of the default
  • --page_range specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: 0,5-10,20.

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

  • bboxes - detected bounding boxes for text
  • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
  • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
  • confidence - the confidence of the model in the detected text (0-1)
  • vertical_lines - vertical lines detected in the document
  • bbox - the axis-aligned line coordinates.
  • page - the page number in the file
  • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance tips

Detection is a torch model. DETECTOR_BATCH_SIZE defaults to an auto-picked value at runtime; override the env var to control VRAM usage on GPU and raise it on larger cards.

From python

from PIL import Image
from surya.detection import DetectionPredictor

det_predictor = DetectionPredictor()
predictions = det_predictor([Image.open(IMAGE_PATH)])

Layout and reading order

This command will write out a json file with the detected layout and reading order.

surya_layout DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --images will save images of the pages and detected text lines (optional)
  • --output_dir specifies the directory to save results to instead of the default
  • --page_range specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: 0,5-10,20.

The results.json file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:

  • bboxes - layout boxes in

Core symbols most depended-on inside this repo

get_logger
called by 19
surya/logging.py
val2tuple
called by 15
surya/detection/model/encoderdecoder.py
to
called by 12
surya/common/predictor.py
_show_timing
called by 9
surya/scripts/streamlit_app.py
reshape
called by 8
surya/ocr_error/model/encoder.py
from_pretrained
called by 6
surya/common/s3.py
configure_logging
called by 5
surya/logging.py
get_default_manager
called by 5
surya/inference/__init__.py

Shape

Method 178
Function 141
Class 73
Route 13

Languages

Python100%

Modules by API surface

surya/detection/model/encoderdecoder.py53 symbols
surya/ocr_error/model/encoder.py42 symbols
surya/ocr_error/tokenizer.py26 symbols
surya/scripts/screenshot_app.py18 symbols
surya/inference/backends/spawn.py18 symbols
surya/common/polygon.py16 symbols
surya/scripts/streamlit_app.py15 symbols
surya/inference/parsers.py11 symbols
surya/settings.py10 symbols
surya/inference/backends/vllm.py10 symbols
surya/inference/backends/llamacpp.py10 symbols
surya/table_rec/__init__.py9 symbols

Dependencies from manifests, versioned

opencv-python-headless4.11.0.86 · 1×
pypdfium24.30.0 · 1×
transformers4.56.1 · 1×

For agents

$ claude mcp add surya \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact