r""" Given Unicode text as input, fix inconsistencies and glitches in it, such as mojibake (text that was decoded in the wrong encoding). Let's start with some examples: >>> fix_text('✔ No problems') '✔ No problems' >>> print(fix_text("¯\\_(ã\x83\x84)_/&
(text: str, config: TextFixerConfig | None = None, **kwargs: Any)
| 288 | |
| 289 | |
| 290 | def fix_text(text: str, config: TextFixerConfig | None = None, **kwargs: Any) -> str: |
| 291 | r""" |
| 292 | Given Unicode text as input, fix inconsistencies and glitches in it, |
| 293 | such as mojibake (text that was decoded in the wrong encoding). |
| 294 | |
| 295 | Let's start with some examples: |
| 296 | |
| 297 | >>> fix_text('✔ No problems') |
| 298 | '✔ No problems' |
| 299 | |
| 300 | >>> print(fix_text("¯\\_(ã\x83\x84)_/¯")) |
| 301 | ¯\_(ツ)_/¯ |
| 302 | |
| 303 | >>> fix_text('Broken text… it’s flubberific!') |
| 304 | "Broken text... it's flubberific!" |
| 305 | |
| 306 | >>> fix_text('LOUD NOISES') |
| 307 | 'LOUD NOISES' |
| 308 | |
| 309 | ftfy applies a number of different fixes to the text, and can accept |
| 310 | configuration to select which fixes to apply. |
| 311 | |
| 312 | The configuration takes the form of a :class:`TextFixerConfig` object, |
| 313 | and you can see a description of the options in that class's docstring |
| 314 | or in the full documentation at ftfy.readthedocs.org. |
| 315 | |
| 316 | For convenience and backward compatibility, the configuration can also |
| 317 | take the form of keyword arguments, which will set the equivalently-named |
| 318 | fields of the TextFixerConfig object. |
| 319 | |
| 320 | For example, here are two ways to fix text but skip the "uncurl_quotes" |
| 321 | step:: |
| 322 | |
| 323 | fix_text(text, TextFixerConfig(uncurl_quotes=False)) |
| 324 | fix_text(text, uncurl_quotes=False) |
| 325 | |
| 326 | This function fixes text in independent segments, which are usually lines |
| 327 | of text, or arbitrarily broken up every 1 million codepoints (configurable |
| 328 | with `config.max_decode_length`) if there aren't enough line breaks. The |
| 329 | bound on segment lengths helps to avoid unbounded slowdowns. |
| 330 | |
| 331 | ftfy can also provide an 'explanation', a list of transformations it applied |
| 332 | to the text that would fix more text like it. This function doesn't provide |
| 333 | explanations (because there may be different fixes for different segments |
| 334 | of text). |
| 335 | |
| 336 | To get an explanation, use the :func:`fix_and_explain()` function, which |
| 337 | fixes the string in one segment and explains what it fixed. |
| 338 | """ |
| 339 | |
| 340 | if config is None: |
| 341 | config = TextFixerConfig(explain=False) |
| 342 | config = _config_from_kwargs(config, kwargs) |
| 343 | if isinstance(text, bytes): |
| 344 | raise UnicodeError(BYTES_ERROR_TEXT) |
| 345 | |
| 346 | out = [] |
| 347 | pos = 0 |