MCPcopy Index your code
hub / github.com/idank/explainshell / clean_mandoc_artifacts

Function clean_mandoc_artifacts

explainshell/extraction/llm/text.py:88–111  ·  view source on GitHub ↗

Normalize HTML entities mandoc -T markdown emits. ` ` becomes a plain space — its non-breaking semantics aren't needed downstream and ASCII whitespace is friendlier to the LLM. `‌` is preserved. mandoc inserts it between abutting bold/italic emphasis spans (`**foo**‌

(text: str)

Source from the content-addressed store, hash-verified

86
87
88def clean_mandoc_artifacts(text: str) -> str:
89 """Normalize HTML entities mandoc -T markdown emits.
90
91 ` ` becomes a plain space — its non-breaking semantics aren't needed
92 downstream and ASCII whitespace is friendlier to the LLM.
93
94 `‌` is preserved. mandoc inserts it between abutting bold/italic
95 emphasis spans (`**foo**‌*bar*`) so CommonMark's delimiter-run
96 flanking rules see the `;` (ASCII punctuation) and parse the runs as
97 separate spans. Stripping it produced patterns like
98 `**--config-file=*****file*` in stored option text, which CommonMark
99 folds into garbled emphasis.
100
101 The choice of `‌` is unusual industry-wide — Pandoc and Pod::Markdown
102 take different routes (Pandoc emits no separator; Pod::Markdown uses `_`
103 for italic to avoid collisions). Both alternatives fail for our corpus:
104 Pandoc's "trust CommonMark rule-of-3" misparses the
105 `**flag**+arg-name` pattern that mandoc-emitted man pages produce
106 constantly; Pod::Markdown's `_` italic breaks intraword italic
107 (`_N_th` → literal underscores) which is rampant in real man pages
108 (ffmpeg.1, git.1, tar.1, ...). mandoc's `‌` happens to dodge both
109 pitfalls for our specific consumer (cmarkgfm + LLM).
110 """
111 return text.replace(" ", " ")
112
113
114def filter_sections(text: str) -> tuple[str, dict[str, int]]:

Callers 4

prepareMethod · 0.90
test_no_nbsp_entitiesMethod · 0.90
_filtered_metricsFunction · 0.90

Calls

no outgoing calls

Tested by 2

test_no_nbsp_entitiesMethod · 0.72