MCPcopy
hub / github.com/HKUDS/RAG-Anything / DoclingParser

Class DoclingParser

raganything/parser.py:1455–2049  ·  view source on GitHub ↗

Docling document parsing utility class. Specialized in parsing Office documents and HTML files, converting the content into structured data and generating markdown and JSON output. Backed by the Docling Python API (`docling.document_converter.DocumentConverter`) to avoid subpr

Source from the content-addressed store, hash-verified

1453
1454
1455class DoclingParser(Parser):
1456 """
1457 Docling document parsing utility class.
1458
1459 Specialized in parsing Office documents and HTML files, converting the content
1460 into structured data and generating markdown and JSON output.
1461
1462 Backed by the Docling Python API (`docling.document_converter.DocumentConverter`)
1463 to avoid subprocess overhead and re-initialization of Docling's deep-learning
1464 models on every call. A `DocumentConverter` instance is built lazily on first
1465 use and cached per pipeline-option combination so that subsequent parses
1466 against the same configuration reuse already-loaded models.
1467
1468 Compatibility changes vs. earlier CLI-subprocess implementation
1469 ----------------------------------------------------------------
1470 - `check_installation()` now returns True iff the Docling Python package
1471 can be imported (`docling.document_converter.DocumentConverter`). The
1472 previous behavior of probing the `docling` CLI executable on PATH is
1473 gone; environments that ship the CLI without the importable package
1474 (or vice versa) will see a different result than before.
1475 - The legacy `env={...}` kwarg is still accepted for source-level
1476 compatibility but is **ignored**: the Python API does not run a
1477 subprocess, so per-call environment overrides no longer take effect.
1478 Callers needing model-cache, proxy, or CUDA configuration should set
1479 the corresponding environment variables in the parent process before
1480 instantiating `DoclingParser`, or configure Docling directly via
1481 `_get_converter` kwargs (`artifacts_path`, `table_mode`, ...).
1482 - JSON and Markdown artifacts are still written to
1483 `<output_dir>/<file_stem>/docling/` for backward compatibility, but
1484 they are produced by Docling&#x27;s `export_to_dict()` /
1485 `export_to_markdown()` rather than by the CLI&#x27;s serializer; expect the
1486 same logical content but not byte-identical files (key ordering,
1487 whitespace, optional fields may differ).
1488
1489 Concurrency
1490 -----------
1491 The internal converter cache is guarded by a lock so that a single
1492 `DoclingParser` instance can be safely shared across threads without
1493 duplicating Docling model loads on first use.
1494 """
1495
1496 # Define Docling-specific formats
1497 HTML_FORMATS = {".html", ".htm", ".xhtml"}
1498
1499 def __init__(self) -> None:
1500 """Initialize DoclingParser"""
1501 super().__init__()
1502 # Cache of DocumentConverter instances keyed by pipeline-option tuple,
1503 # so that loaded layout/OCR/table models are reused across calls.
1504 # The lock guards concurrent first-use from creating duplicate
1505 # converters (and re-loading models) when the same DoclingParser
1506 # instance is shared across threads.
1507 self._converter_cache: Dict[Tuple, Any] = {}
1508 self._converter_cache_lock = threading.Lock()
1509
1510 def parse_pdf(
1511 self,
1512 pdf_path: Union[str, Path],

Callers 2

docling_parserFunction · 0.90
get_parserFunction · 0.85

Calls

no outgoing calls

Tested by 1

docling_parserFunction · 0.72