hub / github.com/HKUDS/RAG-Anything / DoclingParser

Class DoclingParser

raganything/parser.py:1455–2049 · view source on GitHub ↗

Docling document parsing utility class. Specialized in parsing Office documents and HTML files, converting the content into structured data and generating markdown and JSON output. Backed by the Docling Python API (`docling.document_converter.DocumentConverter`) to avoid subpr

Source from the content-addressed store, hash-verified

1453
1454
1455	class DoclingParser(Parser):
1456	"""
1457	Docling document parsing utility class.
1458
1459	Specialized in parsing Office documents and HTML files, converting the content
1460	into structured data and generating markdown and JSON output.
1461
1462	Backed by the Docling Python API (`docling.document_converter.DocumentConverter`)
1463	to avoid subprocess overhead and re-initialization of Docling's deep-learning
1464	models on every call. A `DocumentConverter` instance is built lazily on first
1465	use and cached per pipeline-option combination so that subsequent parses
1466	against the same configuration reuse already-loaded models.
1467
1468	Compatibility changes vs. earlier CLI-subprocess implementation
1469	----------------------------------------------------------------
1470	- `check_installation()` now returns True iff the Docling Python package
1471	can be imported (`docling.document_converter.DocumentConverter`). The
1472	previous behavior of probing the `docling` CLI executable on PATH is
1473	gone; environments that ship the CLI without the importable package
1474	(or vice versa) will see a different result than before.
1475	- The legacy `env={...}` kwarg is still accepted for source-level
1476	compatibility but is ignored: the Python API does not run a
1477	subprocess, so per-call environment overrides no longer take effect.
1478	Callers needing model-cache, proxy, or CUDA configuration should set
1479	the corresponding environment variables in the parent process before
1480	instantiating `DoclingParser`, or configure Docling directly via
1481	`_get_converter` kwargs (`artifacts_path`, `table_mode`, ...).
1482	- JSON and Markdown artifacts are still written to
1483	`<output_dir>/<file_stem>/docling/` for backward compatibility, but
1484	they are produced by Docling's `export_to_dict()` /
1485	`export_to_markdown()` rather than by the CLI's serializer; expect the
1486	same logical content but not byte-identical files (key ordering,
1487	whitespace, optional fields may differ).
1488
1489	Concurrency
1490	-----------
1491	The internal converter cache is guarded by a lock so that a single
1492	`DoclingParser` instance can be safely shared across threads without
1493	duplicating Docling model loads on first use.
1494	"""
1495
1496	# Define Docling-specific formats
1497	HTML_FORMATS = {".html", ".htm", ".xhtml"}
1498
1499	def __init__(self) -> None:
1500	"""Initialize DoclingParser"""
1501	super().__init__()
1502	# Cache of DocumentConverter instances keyed by pipeline-option tuple,
1503	# so that loaded layout/OCR/table models are reused across calls.
1504	# The lock guards concurrent first-use from creating duplicate
1505	# converters (and re-loading models) when the same DoclingParser
1506	# instance is shared across threads.
1507	self._converter_cache: Dict[Tuple, Any] = {}
1508	self._converter_cache_lock = threading.Lock()
1509
1510	def parse_pdf(
1511	self,
1512	pdf_path: Union[str, Path],

Callers 2

docling_parserFunction · 0.90

get_parserFunction · 0.85

Calls

no outgoing calls

Tested by 1

docling_parserFunction · 0.72