This function identifies sequences where information has been lost in a "sloppy" codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. A further explanation: ftfy can now fix text in a few case
(byts: bytes)
| 440 | |
| 441 | |
| 442 | def replace_lossy_sequences(byts: bytes) -> bytes: |
| 443 | """ |
| 444 | This function identifies sequences where information has been lost in |
| 445 | a "sloppy" codec, indicated by byte 1A, and if they would otherwise look |
| 446 | like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. |
| 447 | |
| 448 | A further explanation: |
| 449 | |
| 450 | ftfy can now fix text in a few cases that it would previously fix |
| 451 | incompletely, because of the fact that it can't successfully apply the fix |
| 452 | to the entire string. A very common case of this is when characters have |
| 453 | been erroneously decoded as windows-1252, but instead of the "sloppy" |
| 454 | windows-1252 that passes through unassigned bytes, the unassigned bytes get |
| 455 | turned into U+FFFD (�), so we can't tell what they were. |
| 456 | |
| 457 | This most commonly happens with curly quotation marks that appear |
| 458 | ``“ like this �``. |
| 459 | |
| 460 | We can do better by building on ftfy's "sloppy codecs" to let them handle |
| 461 | less-sloppy but more-lossy text. When they encounter the character ``�``, |
| 462 | instead of refusing to encode it, they encode it as byte 1A -- an |
| 463 | ASCII control code called SUBSTITUTE that once was meant for about the same |
| 464 | purpose. We can then apply a fixer that looks for UTF-8 sequences where |
| 465 | some continuation bytes have been replaced by byte 1A, and decode the whole |
| 466 | sequence as �; if that doesn't work, it'll just turn the byte back into � |
| 467 | itself. |
| 468 | |
| 469 | As a result, the above text ``“ like this �`` will decode as |
| 470 | ``“ like this �``. |
| 471 | |
| 472 | If U+1A was actually in the original string, then the sloppy codecs will |
| 473 | not be used, and this function will not be run, so your weird control |
| 474 | character will be left alone but wacky fixes like this won't be possible. |
| 475 | |
| 476 | This is used as a transcoder within `fix_encoding`. |
| 477 | """ |
| 478 | return LOSSY_UTF8_RE.sub("\ufffd".encode(), byts) |
| 479 | |
| 480 | |
| 481 | def decode_inconsistent_utf8(text: str) -> str: |