hub / github.com/langroid/langroid / test_chunking_sizes

Function test_chunking_sizes

tests/main/test_md_parser.py:314–381 · view source on GitHub ↗

Test that the chunking logic produces chunks that: - Have token counts between the lower and upper bounds (except possibly the final chunk) - Include the header enrichment in each chunk's text - Include the expected overlap between consecutive chunks

(
    chunk_size: int,
    rollup: bool,
)

Source from the content-addressed store, hash-verified

312	@pytest.mark.parametrize("rollup", [False, True])
313	@pytest.mark.parametrize("chunk_size", [20, 500])
314	def test_chunking_sizes(
315	chunk_size: int,
316	rollup: bool,
317	):
318	"""
319	Test that the chunking logic produces chunks that:
320	- Have token counts between the lower and upper bounds
321	(except possibly the final chunk)
322	- Include the header enrichment in each chunk's text
323	- Include the expected overlap between consecutive chunks
324	"""
325	# Create a long text consisting of 200 repeated tokens ("word")
326	long_text = " ".join(["word"] * 200) # 200 tokens
327	md_text = f"""# Chapter 1
328	{long_text}
329	"""
330
331	# Set chunking configuration.
332	# Here chunk_size=50 means that (with variation_percent=0.2)
333	# we expect chunks to have between 40 and 60 tokens.
334	config = MarkdownChunkConfig(
335	chunk_size=chunk_size, rollup=rollup, overlap_tokens=5, variation_percent=0.2
336	)
337
338	# Produce the enriched chunks from the tree.
339	chunks = chunk_markdown(md_text, config)
340
341	# Compute the allowed bounds.
342	lower_bound = config.chunk_size * (1 - config.variation_percent)
343	upper_bound = config.chunk_size * (1 + config.variation_percent)
344
345	# Verify each chunk's token count.
346	# For all chunks except possibly the final one,
347	# we expect at least lower_bound tokens.
348	for i, chunk in enumerate(chunks):
349	tokens = count_words(chunk)
350	if i < len(chunks) - 1:
351	assert (
352	tokens >= lower_bound
353	), f"Chunk {i} has {tokens} tokens, expected at least {lower_bound}"
354	assert (
355	tokens <= 2 * upper_bound
356	), ( # relaxed check
357	f"Chunk {i} has {tokens} tokens, expected at most {upper_bound}"
358	)
359
360	# Check that each chunk is enriched with the header context.
361	# Each chunk's text should contain "Chapter 1" since that is the header path.
362	for i, chunk in enumerate(chunks):
363	assert "Chapter 1" in chunk, f"Chunk {i} is missing header enrichment"
364
365	# Verify that consecutive chunks share the expected overlap.
366	# For each consecutive pair of chunks, the last `overlap_tokens`
367	# tokens of the previous chunk
368	# should appear among the first tokens of the next chunk.
369	if len(chunks) > 1:
370	for i in range(len(chunks) - 1):
371	prev_tokens = chunks[i].split()

Callers

nothing calls this directly

Calls 4

MarkdownChunkConfigClass · 0.90

chunk_markdownFunction · 0.90

count_wordsFunction · 0.90

splitMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…