hub / github.com/rspeer/python-ftfy / replace_lossy_sequences

Function replace_lossy_sequences

ftfy/fixes.py:442–478 · view source on GitHub ↗

This function identifies sequences where information has been lost in a "sloppy" codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. A further explanation: ftfy can now fix text in a few case

(byts: bytes)

Source from the content-addressed store, hash-verified

440
441
442	def replace_lossy_sequences(byts: bytes) -> bytes:
443	"""
444	This function identifies sequences where information has been lost in
445	a "sloppy" codec, indicated by byte 1A, and if they would otherwise look
446	like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD.
447
448	A further explanation:
449
450	ftfy can now fix text in a few cases that it would previously fix
451	incompletely, because of the fact that it can't successfully apply the fix
452	to the entire string. A very common case of this is when characters have
453	been erroneously decoded as windows-1252, but instead of the "sloppy"
454	windows-1252 that passes through unassigned bytes, the unassigned bytes get
455	turned into U+FFFD (�), so we can't tell what they were.
456
457	This most commonly happens with curly quotation marks that appear
458	``â€œ like this â€�``.
459
460	We can do better by building on ftfy's "sloppy codecs" to let them handle
461	less-sloppy but more-lossy text. When they encounter the character ``�``,
462	instead of refusing to encode it, they encode it as byte 1A -- an
463	ASCII control code called SUBSTITUTE that once was meant for about the same
464	purpose. We can then apply a fixer that looks for UTF-8 sequences where
465	some continuation bytes have been replaced by byte 1A, and decode the whole
466	sequence as �; if that doesn't work, it'll just turn the byte back into �
467	itself.
468
469	As a result, the above text ``â€œ like this â€�`` will decode as
470	``“ like this �``.
471
472	If U+1A was actually in the original string, then the sloppy codecs will
473	not be used, and this function will not be run, so your weird control
474	character will be left alone but wacky fixes like this won't be possible.
475
476	This is used as a transcoder within `fix_encoding`.
477	"""
478	return LOSSY_UTF8_RE.sub("\ufffd".encode(), byts)
479
480
481	def decode_inconsistent_utf8(text: str) -> str:

Callers

nothing calls this directly

Calls 1

encodeMethod · 0.45

Tested by

no test coverage detected