Removes HTML or XML character references and entities from a text string. :param text The HTML (or XML) source text. :return The plain text, as a Unicode string, if necessary.
(text)
| 713 | |
| 714 | |
| 715 | def unescape(text): |
| 716 | """ |
| 717 | Removes HTML or XML character references and entities from a text string. |
| 718 | |
| 719 | :param text The HTML (or XML) source text. |
| 720 | :return The plain text, as a Unicode string, if necessary. |
| 721 | """ |
| 722 | |
| 723 | def fixup(m): |
| 724 | text = m.group(0) |
| 725 | code = m.group(1) |
| 726 | try: |
| 727 | if text[1] == "#": # character reference |
| 728 | if text[2] == "x": |
| 729 | return chr(int(code[1:], 16)) |
| 730 | else: |
| 731 | return chr(int(code)) |
| 732 | else: # named entity |
| 733 | return chr(name2codepoint[code]) |
| 734 | except: |
| 735 | return text # leave as is |
| 736 | |
| 737 | return re.sub("&#?(\w+);", fixup, text) |
| 738 | |
| 739 | |
| 740 | # Match HTML comments |
no outgoing calls
no test coverage detected