Docling document parsing utility class. Specialized in parsing Office documents and HTML files, converting the content into structured data and generating markdown and JSON output. Backed by the Docling Python API (`docling.document_converter.DocumentConverter`) to avoid subpr
| 1453 | |
| 1454 | |
| 1455 | class DoclingParser(Parser): |
| 1456 | """ |
| 1457 | Docling document parsing utility class. |
| 1458 | |
| 1459 | Specialized in parsing Office documents and HTML files, converting the content |
| 1460 | into structured data and generating markdown and JSON output. |
| 1461 | |
| 1462 | Backed by the Docling Python API (`docling.document_converter.DocumentConverter`) |
| 1463 | to avoid subprocess overhead and re-initialization of Docling's deep-learning |
| 1464 | models on every call. A `DocumentConverter` instance is built lazily on first |
| 1465 | use and cached per pipeline-option combination so that subsequent parses |
| 1466 | against the same configuration reuse already-loaded models. |
| 1467 | |
| 1468 | Compatibility changes vs. earlier CLI-subprocess implementation |
| 1469 | ---------------------------------------------------------------- |
| 1470 | - `check_installation()` now returns True iff the Docling Python package |
| 1471 | can be imported (`docling.document_converter.DocumentConverter`). The |
| 1472 | previous behavior of probing the `docling` CLI executable on PATH is |
| 1473 | gone; environments that ship the CLI without the importable package |
| 1474 | (or vice versa) will see a different result than before. |
| 1475 | - The legacy `env={...}` kwarg is still accepted for source-level |
| 1476 | compatibility but is **ignored**: the Python API does not run a |
| 1477 | subprocess, so per-call environment overrides no longer take effect. |
| 1478 | Callers needing model-cache, proxy, or CUDA configuration should set |
| 1479 | the corresponding environment variables in the parent process before |
| 1480 | instantiating `DoclingParser`, or configure Docling directly via |
| 1481 | `_get_converter` kwargs (`artifacts_path`, `table_mode`, ...). |
| 1482 | - JSON and Markdown artifacts are still written to |
| 1483 | `<output_dir>/<file_stem>/docling/` for backward compatibility, but |
| 1484 | they are produced by Docling's `export_to_dict()` / |
| 1485 | `export_to_markdown()` rather than by the CLI's serializer; expect the |
| 1486 | same logical content but not byte-identical files (key ordering, |
| 1487 | whitespace, optional fields may differ). |
| 1488 | |
| 1489 | Concurrency |
| 1490 | ----------- |
| 1491 | The internal converter cache is guarded by a lock so that a single |
| 1492 | `DoclingParser` instance can be safely shared across threads without |
| 1493 | duplicating Docling model loads on first use. |
| 1494 | """ |
| 1495 | |
| 1496 | # Define Docling-specific formats |
| 1497 | HTML_FORMATS = {".html", ".htm", ".xhtml"} |
| 1498 | |
| 1499 | def __init__(self) -> None: |
| 1500 | """Initialize DoclingParser""" |
| 1501 | super().__init__() |
| 1502 | # Cache of DocumentConverter instances keyed by pipeline-option tuple, |
| 1503 | # so that loaded layout/OCR/table models are reused across calls. |
| 1504 | # The lock guards concurrent first-use from creating duplicate |
| 1505 | # converters (and re-loading models) when the same DoclingParser |
| 1506 | # instance is shared across threads. |
| 1507 | self._converter_cache: Dict[Tuple, Any] = {} |
| 1508 | self._converter_cache_lock = threading.Lock() |
| 1509 | |
| 1510 | def parse_pdf( |
| 1511 | self, |
| 1512 | pdf_path: Union[str, Path], |
no outgoing calls