Normalize HTML entities mandoc -T markdown emits. ` ` becomes a plain space — its non-breaking semantics aren't needed downstream and ASCII whitespace is friendlier to the LLM. `‌` is preserved. mandoc inserts it between abutting bold/italic emphasis spans (`**foo**‌
(text: str)
| 86 | |
| 87 | |
| 88 | def clean_mandoc_artifacts(text: str) -> str: |
| 89 | """Normalize HTML entities mandoc -T markdown emits. |
| 90 | |
| 91 | ` ` becomes a plain space — its non-breaking semantics aren't needed |
| 92 | downstream and ASCII whitespace is friendlier to the LLM. |
| 93 | |
| 94 | `‌` is preserved. mandoc inserts it between abutting bold/italic |
| 95 | emphasis spans (`**foo**‌*bar*`) so CommonMark's delimiter-run |
| 96 | flanking rules see the `;` (ASCII punctuation) and parse the runs as |
| 97 | separate spans. Stripping it produced patterns like |
| 98 | `**--config-file=*****file*` in stored option text, which CommonMark |
| 99 | folds into garbled emphasis. |
| 100 | |
| 101 | The choice of `‌` is unusual industry-wide — Pandoc and Pod::Markdown |
| 102 | take different routes (Pandoc emits no separator; Pod::Markdown uses `_` |
| 103 | for italic to avoid collisions). Both alternatives fail for our corpus: |
| 104 | Pandoc's "trust CommonMark rule-of-3" misparses the |
| 105 | `**flag**+arg-name` pattern that mandoc-emitted man pages produce |
| 106 | constantly; Pod::Markdown's `_` italic breaks intraword italic |
| 107 | (`_N_th` → literal underscores) which is rampant in real man pages |
| 108 | (ffmpeg.1, git.1, tar.1, ...). mandoc's `‌` happens to dodge both |
| 109 | pitfalls for our specific consumer (cmarkgfm + LLM). |
| 110 | """ |
| 111 | return text.replace(" ", " ") |
| 112 | |
| 113 | |
| 114 | def filter_sections(text: str) -> tuple[str, dict[str, int]]: |
no outgoing calls