MCPcopy
hub / github.com/rspeer/python-ftfy / unescape_html

Function unescape_html

ftfy/fixes.py:94–139  ·  view source on GitHub ↗

Decode HTML entities and character references, including some nonstandard ones written in all-caps. Python has a built-in called `html.unescape` that can decode HTML escapes, including a bunch of messy edge cases such as decoding escapes without semicolons such as "&amp".

(text: str)

Source from the content-addressed store, hash-verified

92
93
94def unescape_html(text: str) -> str:
95 """
96 Decode HTML entities and character references, including some nonstandard
97 ones written in all-caps.
98
99 Python has a built-in called `html.unescape` that can decode HTML escapes,
100 including a bunch of messy edge cases such as decoding escapes without
101 semicolons such as "&amp".
102
103 If you know you've got HTML-escaped text, applying `html.unescape` is the
104 right way to convert it to plain text. But in ambiguous situations, that
105 would create false positives. For example, the informally written text
106 "this&not that" should not automatically be decoded as "this¬ that".
107
108 In this function, we decode the escape sequences that appear in the
109 `html.entities.html5` dictionary, as long as they are the unambiguous ones
110 that end in semicolons.
111
112 We also decode all-caps versions of Latin letters and common symbols.
113 If a database contains the name 'PÉREZ', we can read that and intuit
114 that it was supposed to say 'PÉREZ'. This is limited to a smaller set of
115 entities, because there are many instances where entity names are
116 case-sensitive in complicated ways.
117
118 >>> unescape_html('<tag>')
119 '<tag>'
120
121 >>> unescape_html('&Jscr;ohn &HilbertSpace;ancock')
122 '𝒥ohn ℋancock'
123
124 >>> unescape_html('&checkmark;')
125 '✓'
126
127 >>> unescape_html('P&eacute;rez')
128 'Pérez'
129
130 >>> unescape_html('P&EACUTE;REZ')
131 'PÉREZ'
132
133 >>> unescape_html('BUNDESSTRA&SZLIG;E')
134 'BUNDESSTRASSE'
135
136 >>> unescape_html('&ntilde; &Ntilde; &NTILDE; &nTILDE;')
137 'ñ Ñ Ñ &nTILDE;'
138 """
139 return HTML_ENTITY_RE.sub(_unescape_fixup, text)
140
141
142ANSI_RE = re.compile("\033\\[((?:\\d|;)*)([a-zA-Z])")

Callers 1

test_entitiesFunction · 0.90

Calls

no outgoing calls

Tested by 1

test_entitiesFunction · 0.72