MCPcopy
hub / github.com/rspeer/python-ftfy / replace_lossy_sequences

Function replace_lossy_sequences

ftfy/fixes.py:442–478  ·  view source on GitHub ↗

This function identifies sequences where information has been lost in a "sloppy" codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. A further explanation: ftfy can now fix text in a few case

(byts: bytes)

Source from the content-addressed store, hash-verified

440
441
442def replace_lossy_sequences(byts: bytes) -> bytes:
443 """
444 This function identifies sequences where information has been lost in
445 a "sloppy" codec, indicated by byte 1A, and if they would otherwise look
446 like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD.
447
448 A further explanation:
449
450 ftfy can now fix text in a few cases that it would previously fix
451 incompletely, because of the fact that it can't successfully apply the fix
452 to the entire string. A very common case of this is when characters have
453 been erroneously decoded as windows-1252, but instead of the "sloppy"
454 windows-1252 that passes through unassigned bytes, the unassigned bytes get
455 turned into U+FFFD (�), so we can't tell what they were.
456
457 This most commonly happens with curly quotation marks that appear
458 ``“ like this �``.
459
460 We can do better by building on ftfy's "sloppy codecs" to let them handle
461 less-sloppy but more-lossy text. When they encounter the character ``�``,
462 instead of refusing to encode it, they encode it as byte 1A -- an
463 ASCII control code called SUBSTITUTE that once was meant for about the same
464 purpose. We can then apply a fixer that looks for UTF-8 sequences where
465 some continuation bytes have been replaced by byte 1A, and decode the whole
466 sequence as �; if that doesn't work, it'll just turn the byte back into �
467 itself.
468
469 As a result, the above text ``“ like this �`` will decode as
470 ``“ like this �``.
471
472 If U+1A was actually in the original string, then the sloppy codecs will
473 not be used, and this function will not be run, so your weird control
474 character will be left alone but wacky fixes like this won't be possible.
475
476 This is used as a transcoder within `fix_encoding`.
477 """
478 return LOSSY_UTF8_RE.sub("\ufffd".encode(), byts)
479
480
481def decode_inconsistent_utf8(text: str) -> str:

Callers

nothing calls this directly

Calls 1

encodeMethod · 0.45

Tested by

no test coverage detected