NOTE: Using `guess_bytes` is not the recommended way of using ftfy. ftfy is not designed to be an encoding detector. In the unfortunate situation that you have some bytes in an unknown encoding, ftfy can guess a reasonable strategy for decoding them, by trying a few common enco
(bstring: bytes)
| 654 | |
| 655 | |
| 656 | def guess_bytes(bstring: bytes) -> tuple[str, str]: |
| 657 | """ |
| 658 | NOTE: Using `guess_bytes` is not the recommended way of using ftfy. ftfy |
| 659 | is not designed to be an encoding detector. |
| 660 | |
| 661 | In the unfortunate situation that you have some bytes in an unknown |
| 662 | encoding, ftfy can guess a reasonable strategy for decoding them, by trying |
| 663 | a few common encodings that can be distinguished from each other. |
| 664 | |
| 665 | Unlike the rest of ftfy, this may not be accurate, and it may *create* |
| 666 | Unicode problems instead of solving them! |
| 667 | |
| 668 | The encodings we try here are: |
| 669 | |
| 670 | - UTF-16 with a byte order mark, because a UTF-16 byte order mark looks |
| 671 | like nothing else |
| 672 | - UTF-8, because it's the global standard, which has been used by a |
| 673 | majority of the Web since 2008 |
| 674 | - "utf-8-variants", or buggy implementations of UTF-8 |
| 675 | - MacRoman, because Microsoft Office thinks it's still a thing, and it |
| 676 | can be distinguished by its line breaks. (If there are no line breaks in |
| 677 | the string, though, you're out of luck.) |
| 678 | - "sloppy-windows-1252", the Latin-1-like encoding that is the most common |
| 679 | single-byte encoding. |
| 680 | """ |
| 681 | if isinstance(bstring, str): |
| 682 | raise UnicodeError( |
| 683 | "This string was already decoded as Unicode. You should pass " |
| 684 | "bytes to guess_bytes, not Unicode." |
| 685 | ) |
| 686 | |
| 687 | if bstring.startswith(b"\xfe\xff") or bstring.startswith(b"\xff\xfe"): |
| 688 | return bstring.decode("utf-16"), "utf-16" |
| 689 | |
| 690 | byteset = set(bstring) |
| 691 | try: |
| 692 | if 0xED in byteset or 0xC0 in byteset: |
| 693 | # Byte 0xed can be used to encode a range of codepoints that |
| 694 | # are UTF-16 surrogates. UTF-8 does not use UTF-16 surrogates, |
| 695 | # so when we see 0xed, it's very likely we're being asked to |
| 696 | # decode CESU-8, the variant that encodes UTF-16 surrogates |
| 697 | # instead of the original characters themselves. |
| 698 | # |
| 699 | # This will occasionally trigger on standard UTF-8, as there |
| 700 | # are some Korean characters that also use byte 0xed, but that's |
| 701 | # not harmful because standard UTF-8 characters will decode the |
| 702 | # same way in our 'utf-8-variants' codec. |
| 703 | # |
| 704 | # Byte 0xc0 is impossible because, numerically, it would only |
| 705 | # encode characters lower than U+0040. Those already have |
| 706 | # single-byte representations, and UTF-8 requires using the |
| 707 | # shortest possible representation. However, Java hides the null |
| 708 | # codepoint, U+0000, in a non-standard longer representation -- it |
| 709 | # encodes it as 0xc0 0x80 instead of 0x00, guaranteeing that 0x00 |
| 710 | # will never appear in the encoded bytes. |
| 711 | # |
| 712 | # The 'utf-8-variants' decoder can handle both of these cases, as |
| 713 | # well as standard UTF-8, at the cost of a bit of speed. |