MCPcopy
hub / github.com/langroid/langroid / test_pypdfium2_parser

Function test_pypdfium2_parser

tests/main/test_pdf_parser.py:121–155  ·  view source on GitHub ↗

Dedicated functional test for the default `pypdfium2` PDF parser. `pypdfium2` is installed by default (core dependency), so this test does NOT require any optional `extras` to be installed -- it exercises the out-of-the-box PDF-parsing path that a bare `pip install langroid` gets.

(source: str)

Source from the content-addressed store, hash-verified

119
120@pytest.mark.parametrize("source", ["path", "bytes"])
121def test_pypdfium2_parser(source: str):
122 """
123 Dedicated functional test for the default `pypdfium2` PDF parser.
124
125 `pypdfium2` is installed by default (core dependency), so this test does
126 NOT require any optional `extras` to be installed -- it exercises the
127 out-of-the-box PDF-parsing path that a bare `pip install langroid` gets.
128 """
129 from langroid.parsing.document_parser import PyPDFium2Parser
130
131 path = "tests/main/data/dummy.pdf"
132 parser = DocumentParser.create(
133 path, ParsingConfig(pdf=PdfParsingConfig(library="pypdfium2"))
134 )
135 assert isinstance(parser, PyPDFium2Parser)
136
137 if source == "bytes":
138 with open(path, "rb") as f:
139 data = f.read()
140 parser = DocumentParser.create(data, parser.config)
141 assert isinstance(parser, PyPDFium2Parser)
142
143 citation = path if source == "path" else "bytes"
144
145 doc = parser.get_doc()
146 assert isinstance(doc.content, str)
147 # content correctness: known text from the sample PDF
148 assert "Design and Evaluation" in doc.content
149 assert "arXiv:2004.07606v1" in doc.content
150 assert doc.metadata.source == citation
151
152 chunks = parser.get_doc_chunks()
153 assert len(chunks) > 0
154 assert all(c.metadata.is_chunk for c in chunks)
155 assert all(citation in c.metadata.source for c in chunks)
156
157
158# @pytest.mark.skipif(

Callers

nothing calls this directly

Calls 5

ParsingConfigClass · 0.90
PdfParsingConfigClass · 0.90
get_docMethod · 0.80
get_doc_chunksMethod · 0.80
createMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…