Decode HTML entities and character references, including some nonstandard ones written in all-caps. Python has a built-in called `html.unescape` that can decode HTML escapes, including a bunch of messy edge cases such as decoding escapes without semicolons such as "&".
(text: str)
| 92 | |
| 93 | |
| 94 | def unescape_html(text: str) -> str: |
| 95 | """ |
| 96 | Decode HTML entities and character references, including some nonstandard |
| 97 | ones written in all-caps. |
| 98 | |
| 99 | Python has a built-in called `html.unescape` that can decode HTML escapes, |
| 100 | including a bunch of messy edge cases such as decoding escapes without |
| 101 | semicolons such as "&". |
| 102 | |
| 103 | If you know you've got HTML-escaped text, applying `html.unescape` is the |
| 104 | right way to convert it to plain text. But in ambiguous situations, that |
| 105 | would create false positives. For example, the informally written text |
| 106 | "this¬ that" should not automatically be decoded as "this¬ that". |
| 107 | |
| 108 | In this function, we decode the escape sequences that appear in the |
| 109 | `html.entities.html5` dictionary, as long as they are the unambiguous ones |
| 110 | that end in semicolons. |
| 111 | |
| 112 | We also decode all-caps versions of Latin letters and common symbols. |
| 113 | If a database contains the name 'PÉREZ', we can read that and intuit |
| 114 | that it was supposed to say 'PÉREZ'. This is limited to a smaller set of |
| 115 | entities, because there are many instances where entity names are |
| 116 | case-sensitive in complicated ways. |
| 117 | |
| 118 | >>> unescape_html('<tag>') |
| 119 | '<tag>' |
| 120 | |
| 121 | >>> unescape_html('𝒥ohn ℋancock') |
| 122 | '𝒥ohn ℋancock' |
| 123 | |
| 124 | >>> unescape_html('✓') |
| 125 | '✓' |
| 126 | |
| 127 | >>> unescape_html('Pérez') |
| 128 | 'Pérez' |
| 129 | |
| 130 | >>> unescape_html('P&EACUTE;REZ') |
| 131 | 'PÉREZ' |
| 132 | |
| 133 | >>> unescape_html('BUNDESSTRA&SZLIG;E') |
| 134 | 'BUNDESSTRASSE' |
| 135 | |
| 136 | >>> unescape_html('ñ Ñ &NTILDE; &nTILDE;') |
| 137 | 'ñ Ñ Ñ &nTILDE;' |
| 138 | """ |
| 139 | return HTML_ENTITY_RE.sub(_unescape_fixup, text) |
| 140 | |
| 141 | |
| 142 | ANSI_RE = re.compile("\033\\[((?:\\d|;)*)([a-zA-Z])") |
no outgoing calls