MCPcopy
hub / github.com/opendataloader-project/opendataloader-pdf

github.com/opendataloader-project/opendataloader-pdf @v2.4.7 sqlite

repository ↗ · DeepWiki ↗ · release v2.4.7 ↗
2,293 symbols 8,648 edges 219 files 804 documented · 35%
README

OpenDataLoader PDF

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

License PyPI version npm version Maven Central Java

opendataloader-project%2Fopendataloader-pdf | Trendshift

🔍 PDF parser for AI data extraction — Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.907 overall). Deterministic local mode + AI hybrid mode for complex pages.

  • How accurate is it? — #1 in benchmarks: 0.907 overall, 0.928 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages (benchmarks)
  • Scanned PDFs and OCR? — Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ (hybrid mode)
  • Tables, formulas, images, charts? — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode (hybrid mode)
  • How do I use this for RAG?pip install opendataloader-pdf, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs (quick start | LangChain)

PDF accessibility automation — Auto-tag untagged PDFs into screen-reader-ready Tagged PDFs at scale. First open-source tool to generate Tagged PDFs end-to-end.

  • What's the problem? — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale (regulations)
  • What's free? — Layout analysis + auto-tagging (Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency (auto-tagging)
  • What about PDF/UA compliance? — Converting Tagged PDF to PDF/UA-1 or PDF/UA-2 is an enterprise add-on. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step (pipeline)
  • Why trust this? — Built in collaboration with Dual Lab (veraPDF developers) based on PDF Association specifications, best practice guides and expertise of the PDF Community. Auto-tagging follows the Well-Tagged PDF specification, validated with veraPDF (collaboration)

Get Started in 30 Seconds

Requires: Java 11+ and Python 3.10+ (Node.js | Java also available)

Before you start: run java -version. If not found, install JDK 11+ from Adoptium.

pip install -U opendataloader-pdf
import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json"
)

OpenDataLoader PDF layout analysis — headings, tables, images detected with bounding boxes

Annotated PDF output — each element (heading, paragraph, table, image) detected with bounding boxes and semantic type.

What Problems Does This Solve?

Problem Solution Status
PDF structure lost during parsing — wrong reading order, broken tables, no element coordinates Deterministic local PDF to Markdown/JSON with bounding boxes, XY-Cut++ reading order Shipped
Complex tables, scanned PDFs, formulas, charts need AI-level understanding Hybrid mode routes complex pages to AI backend (#1 in benchmarks) Shipped
Manual PDF remediation cost — Accessibility regulations (EAA, ADA, Section 508) demand Tagged PDFs. Manual remediation costs $50–200/doc Auto-tag untagged PDFs into Tagged PDFs (free, Apache 2.0). Foundation for PDF/UA workflows; full PDF/UA-1/2 export is an enterprise add-on Auto-tag: Shipped. PDF/UA export: Enterprise

Capability Matrix

Capability Supported Tier
Data extraction
Extract text with correct reading order Yes Free
Bounding boxes for every element Yes Free
Table extraction (simple borders) Yes Free
Table extraction (complex/borderless) Yes Free (Hybrid)
Heading hierarchy detection Yes Free
List detection (numbered, bulleted, nested) Yes Free
Image extraction with coordinates Yes Free
AI chart/image description Yes Free (Hybrid)
OCR for scanned PDFs Yes Free (Hybrid)
Formula extraction (LaTeX) Yes Free (Hybrid)
Tagged PDF structure extraction Yes Free
AI safety (prompt injection filtering) Yes Free
Header/footer/watermark filtering Yes Free
Accessibility
Auto-tagging → Tagged PDF for untagged PDFs Yes Free (Apache 2.0)
PDF/UA-1, PDF/UA-2 export 💼 Available Enterprise
Accessibility studio (visual editor) 💼 Available Enterprise
Limitations
Process Word/Excel/PPT No
GPU required No

Extraction Benchmarks

opendataloader-pdf [hybrid] ranks #1 overall (0.907) across reading order, table, and heading extraction accuracy.

Engine Overall Reading Order Table Heading Speed (s/page) License
opendataloader [hybrid] 0.907 0.934 0.928 0.821 0.463 Apache-2.0
nutrient 0.885 0.925 0.708 0.819 0.008 Commercial
docling 0.882 0.898 0.887 0.824 0.762 MIT
marker 0.861 0.890 0.808 0.796 53.932 GPL-3.0
unstructured [hi_res] 0.841 0.904 0.588 0.749 3.008 Apache-2.0
edgeparse 0.837 0.894 0.717 0.706 0.036 Apache-2.0
opendataloader 0.831 0.902 0.489 0.739 0.015 Apache-2.0
mineru 0.831 0.857 0.873 0.743 5.962 AGPL-3.0
pymupdf4llm 0.732 0.885 0.401 0.412 0.091 AGPL-3.0
unstructured 0.686 0.882 0.000 0.388 0.077 Apache-2.0
markitdown 0.589 0.844 0.273 0.000 0.114 MIT
liteparse 0.576 0.866 0.000 0.000 1.061 Apache-2.0

Scores normalized to [0, 1]. Higher is better for accuracy; lower is better for speed. Bold = best. Full benchmark details

Benchmark

Quality Breakdown

Which Mode Should I Use?

Your Document Mode Install Server Command Client Command
Standard digital PDF Fast (default) pip install opendataloader-pdf None needed opendataloader-pdf file1.pdf file2.pdf folder/
Complex or nested tables Hybrid pip install "opendataloader-pdf[hybrid]" opendataloader-pdf-hybrid --port 5002 opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
Scanned / image-based PDF Hybrid + OCR pip install "opendataloader-pdf[hybrid]" opendataloader-pdf-hybrid --port 5002 --force-ocr opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
Non-English scanned PDF Hybrid + OCR pip install "opendataloader-pdf[hybrid]" opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en" opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
Mathematical formulas Hybrid + formula pip install "opendataloader-pdf[hybrid]" opendataloader-pdf-hybrid --enrich-formula opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
Charts needing description Hybrid + picture pip install "opendataloader-pdf[hybrid]" opendataloader-pdf-hybrid --enrich-picture-description opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
Untagged PDFs needing accessibility Auto-tagging → Tagged PDF pip install opendataloader-pdf None needed opendataloader-pdf --format tagged-pdf file1.pdf file2.pdf folder/

Quick Start

Python

pip install -U opendataloader-pdf
import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json"
)

Node.js

npm install @opendataloader/pdf
import { convert } from '@opendataloader/pdf';

await convert(['file1.pdf', 'file2.pdf', 'folder/'], {
  outputDir: 'output/',
  format: 'markdown,json'
});

Java

<dependency>
  <groupId>org.opendataloader</groupId>
  <artifactId>opendataloader-pdf-core</artifactId>
</dependency>

Python Quick Start | Node.js Quick Start | Java Quick Start

Hybrid Mode: #1 Accuracy for Complex PDFs

Hybrid mode combines fast local Java processing with AI backends. Simple pages stay local (0.02s); complex pages route to AI for +90% table accuracy.

pip install -U "opendataloader-pdf[hybrid]"

Terminal 1 — Start the backend server:

opendataloader-pdf-hybrid --port 5002

Terminal 2 — Process PDFs:

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

Python:

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    hybrid="docling-fast"
)

OCR for Scanned PDFs

Start the backend with --force-ocr for image-based PDFs with no selectable text:

opendataloader-pdf-hybrid --port 5002 --force-ocr

For non-English documents, specify the language:

opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"

Supported languages: en, ko, ja, ch_sim, ch_tra, de, fr, ar, and more.

Formula Extraction (LaTeX)

Extract mathematical formulas as LaTeX from scientific PDFs:

# Server: enable formula enrichment
opendataloader-pdf-hybrid --enrich-formula

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/

Output in JSON:

{
  "type": "formula",
  "page number": 1,
  "bounding box": [226.2, 144.7, 377.1, 168.7],
  "content": "\\frac{f(x+h) - f(x)}{h}"
}

Note: Formula and picture description enrichments require --hybrid-mode full on the client side.

Chart & Image Description

Generate AI descriptions for charts and images — useful for RAG search and accessibility alt text:

```bash

Server

opendataloader-pdf-hybrid --enrich-picture-description

Batch all files in one call — each invocation spawns a JVM process, so repeated calls ar

Extension points exported contracts — how you extend this code

HybridSchemaTransformer (Interface)
Interface for transforming hybrid backend responses to IObject hierarchy. Implementations of this interface convert [4 …
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/HybridSchemaTransformer.java
Capture (Interface)
(no doc)
node/opendataloader-pdf/test/streaming.integration.test.ts
PageImageCache (Interface)
Cache for page images (pdf2img results). Implementations control where images are stored and when they are evicted. [4 …
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/PageImageCache.java
JarExecutionOptions (Interface)
(no doc)
node/opendataloader-pdf/src/index.ts
HybridClient (Interface)
Interface for hybrid PDF processing backends. Hybrid processing routes pages to external AI backends (like docling, [4 …
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/HybridClient.java
RunOptions (Interface)
(no doc)
node/opendataloader-pdf/src/index.ts

Core symbols most depended-on inside this repo

isEmpty
called by 231
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/TableStructureNormalizer.java
toString
called by 179
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/TriageProcessor.java
getValue
called by 146
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/pdf/PDFLayer.java
equals
called by 92
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/HybridClient.java
setOutputFolder
called by 67
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java
setGenerateJSON
called by 65
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java
write
called by 64
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
processFile
called by 62
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Shape

Method 1,906
Class 205
Function 162
Interface 10
Enum 5
Route 5

Languages

Java89%
Python9%
TypeScript2%

Modules by API surface

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/api/Config.java80 symbols
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/api/cli/CLIOptionsTest.java71 symbols
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/hybrid/HancomAISchemaTransformerTest.java64 symbols
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/TriageProcessor.java64 symbols
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/AutoTaggingProcessor.java55 symbols
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/ElementMetadata.java46 symbols
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HybridDocumentProcessorTest.java41 symbols
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java41 symbols
python/opendataloader-pdf/tests/test_hybrid_server_ocr_options.py40 symbols
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/HancomAISchemaTransformer.java40 symbols
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/readingorder/XYCutPlusPlusSorterTest.java38 symbols
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/PageSeparatorIntegrationTest.java36 symbols

Dependencies from manifests, versioned

com.squareup.okhttp3:mockwebserver4.12.0 · 1×
commons-cli:commons-cli
org.assertj:assertj-core
org.junit.jupiter:junit-jupiter
org.opendataloader:opendataloader-pdf-core
org.verapdf:validation-model
org.verapdf:wcag-algorithms
org.verapdf:wcag-validation
@eslint/js10.0.1 · 1×
@types/node22.0.0 · 1×

For agents

$ claude mcp add opendataloader-pdf \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact