hub / github.com/idank/explainshell / chunk_text

Function chunk_text

explainshell/extraction/llm/text.py:196–301 · view source on GitHub ↗

Split text into numbered chunks at section boundaries. Each chunk gets line numbers corresponding to the original text. If the text is small enough, returns a single chunk. A preamble (NAME/SYNOPSIS/DESCRIPTION intro) is prepended to each chunk beyond the first so the model has cont

(text: str)

Source from the content-addressed store, hash-verified

194
195
196	def chunk_text(text: str) -> list[str]:
197	"""Split text into numbered chunks at section boundaries.
198
199	Each chunk gets line numbers corresponding to the original text.
200	If the text is small enough, returns a single chunk.
201	A preamble (NAME/SYNOPSIS/DESCRIPTION intro) is prepended to each
202	chunk beyond the first so the model has context.
203
204	When a single section exceeds the chunk size, it is sub-split at
205	paragraph (blank-line) boundaries.
206	"""
207	numbered_full, _ = number_lines(text)
208	if len(numbered_full) <= CHUNK_SIZE_CHARS:
209	return [numbered_full]
210
211	sections = _split_sections(text)
212	preamble = _build_preamble(text)
213	preamble_text = ""
214	if preamble:
215	preamble_text = (
216	"[Context — this is a continuation of the same man page]\n\n"
217	+ preamble
218	+ "\n\n---\n\n"
219	)
220	budget = CHUNK_SIZE_CHARS - len(preamble_text)
221
222	total_lines = text.count("\n") + 1
223	width = len(str(total_lines))
224
225	def _number_block(start_line: int, block_text: str) -> str:
226	lines = block_text.split("\n")
227	numbered = []
228	for j, line in enumerate(lines):
229	lineno = start_line + j
230	numbered.append(f"{lineno:>{width}}\| {line}")
231	return "\n".join(numbered)
232
233	def _split_by_lines(start_line: int, block_text: str) -> list[tuple[int, str]]:
234	"""Last-resort split: cut at line boundaries to fit budget."""
235	lines = block_text.split("\n")
236	result: list[tuple[int, str]] = []
237	cur_lines: list[str] = []
238	cur_start = start_line
239	for line in lines:
240	candidate = "\n".join(cur_lines + [line])
241	if len(_number_block(cur_start, candidate)) > budget and cur_lines:
242	result.append((cur_start, "\n".join(cur_lines)))
243	cur_start += len(cur_lines)
244	cur_lines = []
245	cur_lines.append(line)
246	if cur_lines:
247	result.append((cur_start, "\n".join(cur_lines)))
248	return result
249
250	blocks: list[tuple[int, str]] = []
251	for start_line, section_text in sections:
252	numbered = _number_block(start_line, section_text)
253	if len(numbered) <= budget:

Callers 8

prepareMethod · 0.90

test_small_text_no_splitMethod · 0.90

test_large_text_splits_on_sectionsMethod · 0.90

test_oversized_section_with_paragraphs_splitsMethod · 0.90

test_single_unsplittable_lineMethod · 0.90

test_line_numbers_are_globally_correctMethod · 0.90

test_preamble_on_later_chunksMethod · 0.90

test_oversized_preamble_does_not_explode_chunksMethod · 0.90

Calls 6

number_linesFunction · 0.85

_split_sectionsFunction · 0.85

_build_preambleFunction · 0.85

_number_blockFunction · 0.85

_split_by_linesFunction · 0.85

rangeFunction · 0.85

Tested by 7

test_small_text_no_splitMethod · 0.72

test_large_text_splits_on_sectionsMethod · 0.72

test_oversized_section_with_paragraphs_splitsMethod · 0.72

test_single_unsplittable_lineMethod · 0.72

test_line_numbers_are_globally_correctMethod · 0.72

test_preamble_on_later_chunksMethod · 0.72

test_oversized_preamble_does_not_explode_chunksMethod · 0.72