MCPcopy
hub / github.com/rspeer/python-ftfy / guess_bytes

Function guess_bytes

ftfy/__init__.py:656–724  ·  view source on GitHub ↗

NOTE: Using `guess_bytes` is not the recommended way of using ftfy. ftfy is not designed to be an encoding detector. In the unfortunate situation that you have some bytes in an unknown encoding, ftfy can guess a reasonable strategy for decoding them, by trying a few common enco

(bstring: bytes)

Source from the content-addressed store, hash-verified

654
655
656def guess_bytes(bstring: bytes) -> tuple[str, str]:
657 """
658 NOTE: Using `guess_bytes` is not the recommended way of using ftfy. ftfy
659 is not designed to be an encoding detector.
660
661 In the unfortunate situation that you have some bytes in an unknown
662 encoding, ftfy can guess a reasonable strategy for decoding them, by trying
663 a few common encodings that can be distinguished from each other.
664
665 Unlike the rest of ftfy, this may not be accurate, and it may *create*
666 Unicode problems instead of solving them!
667
668 The encodings we try here are:
669
670 - UTF-16 with a byte order mark, because a UTF-16 byte order mark looks
671 like nothing else
672 - UTF-8, because it's the global standard, which has been used by a
673 majority of the Web since 2008
674 - "utf-8-variants", or buggy implementations of UTF-8
675 - MacRoman, because Microsoft Office thinks it's still a thing, and it
676 can be distinguished by its line breaks. (If there are no line breaks in
677 the string, though, you're out of luck.)
678 - "sloppy-windows-1252", the Latin-1-like encoding that is the most common
679 single-byte encoding.
680 """
681 if isinstance(bstring, str):
682 raise UnicodeError(
683 "This string was already decoded as Unicode. You should pass "
684 "bytes to guess_bytes, not Unicode."
685 )
686
687 if bstring.startswith(b"\xfe\xff") or bstring.startswith(b"\xff\xfe"):
688 return bstring.decode("utf-16"), "utf-16"
689
690 byteset = set(bstring)
691 try:
692 if 0xED in byteset or 0xC0 in byteset:
693 # Byte 0xed can be used to encode a range of codepoints that
694 # are UTF-16 surrogates. UTF-8 does not use UTF-16 surrogates,
695 # so when we see 0xed, it's very likely we're being asked to
696 # decode CESU-8, the variant that encodes UTF-16 surrogates
697 # instead of the original characters themselves.
698 #
699 # This will occasionally trigger on standard UTF-8, as there
700 # are some Korean characters that also use byte 0xed, but that's
701 # not harmful because standard UTF-8 characters will decode the
702 # same way in our 'utf-8-variants' codec.
703 #
704 # Byte 0xc0 is impossible because, numerically, it would only
705 # encode characters lower than U+0040. Those already have
706 # single-byte representations, and UTF-8 requires using the
707 # shortest possible representation. However, Java hides the null
708 # codepoint, U+0000, in a non-standard longer representation -- it
709 # encodes it as 0xc0 0x80 instead of 0x00, guaranteeing that 0x00
710 # will never appear in the encoded bytes.
711 #
712 # The 'utf-8-variants' decoder can handle both of these cases, as
713 # well as standard UTF-8, at the cost of a bit of speed.

Callers 4

test_russian_crashFunction · 0.90
test_guess_bytesFunction · 0.90
test_guess_bytes_nullFunction · 0.90
fix_fileFunction · 0.85

Calls 1

decodeMethod · 0.45

Tested by 3

test_russian_crashFunction · 0.72
test_guess_bytesFunction · 0.72
test_guess_bytes_nullFunction · 0.72