hub / github.com/rspeer/python-ftfy / guess_bytes

Function guess_bytes

ftfy/__init__.py:656–724 · view source on GitHub ↗

NOTE: Using `guess_bytes` is not the recommended way of using ftfy. ftfy is not designed to be an encoding detector. In the unfortunate situation that you have some bytes in an unknown encoding, ftfy can guess a reasonable strategy for decoding them, by trying a few common enco

(bstring: bytes)

Source from the content-addressed store, hash-verified

654
655
656	def guess_bytes(bstring: bytes) -> tuple[str, str]:
657	"""
658	NOTE: Using `guess_bytes` is not the recommended way of using ftfy. ftfy
659	is not designed to be an encoding detector.
660
661	In the unfortunate situation that you have some bytes in an unknown
662	encoding, ftfy can guess a reasonable strategy for decoding them, by trying
663	a few common encodings that can be distinguished from each other.
664
665	Unlike the rest of ftfy, this may not be accurate, and it may create
666	Unicode problems instead of solving them!
667
668	The encodings we try here are:
669
670	- UTF-16 with a byte order mark, because a UTF-16 byte order mark looks
671	like nothing else
672	- UTF-8, because it's the global standard, which has been used by a
673	majority of the Web since 2008
674	- "utf-8-variants", or buggy implementations of UTF-8
675	- MacRoman, because Microsoft Office thinks it's still a thing, and it
676	can be distinguished by its line breaks. (If there are no line breaks in
677	the string, though, you're out of luck.)
678	- "sloppy-windows-1252", the Latin-1-like encoding that is the most common
679	single-byte encoding.
680	"""
681	if isinstance(bstring, str):
682	raise UnicodeError(
683	"This string was already decoded as Unicode. You should pass "
684	"bytes to guess_bytes, not Unicode."
685	)
686
687	if bstring.startswith(b"\xfe\xff") or bstring.startswith(b"\xff\xfe"):
688	return bstring.decode("utf-16"), "utf-16"
689
690	byteset = set(bstring)
691	try:
692	if 0xED in byteset or 0xC0 in byteset:
693	# Byte 0xed can be used to encode a range of codepoints that
694	# are UTF-16 surrogates. UTF-8 does not use UTF-16 surrogates,
695	# so when we see 0xed, it's very likely we're being asked to
696	# decode CESU-8, the variant that encodes UTF-16 surrogates
697	# instead of the original characters themselves.
698	#
699	# This will occasionally trigger on standard UTF-8, as there
700	# are some Korean characters that also use byte 0xed, but that's
701	# not harmful because standard UTF-8 characters will decode the
702	# same way in our 'utf-8-variants' codec.
703	#
704	# Byte 0xc0 is impossible because, numerically, it would only
705	# encode characters lower than U+0040. Those already have
706	# single-byte representations, and UTF-8 requires using the
707	# shortest possible representation. However, Java hides the null
708	# codepoint, U+0000, in a non-standard longer representation -- it
709	# encodes it as 0xc0 0x80 instead of 0x00, guaranteeing that 0x00
710	# will never appear in the encoded bytes.
711	#
712	# The 'utf-8-variants' decoder can handle both of these cases, as
713	# well as standard UTF-8, at the cost of a bit of speed.

Callers 4

test_russian_crashFunction · 0.90

test_guess_bytesFunction · 0.90

test_guess_bytes_nullFunction · 0.90

fix_fileFunction · 0.85

Calls 1

decodeMethod · 0.45

Tested by 3

test_russian_crashFunction · 0.72

test_guess_bytesFunction · 0.72

test_guess_bytes_nullFunction · 0.72