hub / github.com/rspeer/python-ftfy / unescape_html

Function unescape_html

ftfy/fixes.py:94–139 · view source on GitHub ↗

Decode HTML entities and character references, including some nonstandard ones written in all-caps. Python has a built-in called `html.unescape` that can decode HTML escapes, including a bunch of messy edge cases such as decoding escapes without semicolons such as "&amp".

(text: str)

Source from the content-addressed store, hash-verified

92
93
94	def unescape_html(text: str) -> str:
95	"""
96	Decode HTML entities and character references, including some nonstandard
97	ones written in all-caps.
98
99	Python has a built-in called `html.unescape` that can decode HTML escapes,
100	including a bunch of messy edge cases such as decoding escapes without
101	semicolons such as "&amp".
102
103	If you know you've got HTML-escaped text, applying `html.unescape` is the
104	right way to convert it to plain text. But in ambiguous situations, that
105	would create false positives. For example, the informally written text
106	"this&not that" should not automatically be decoded as "this¬ that".
107
108	In this function, we decode the escape sequences that appear in the
109	`html.entities.html5` dictionary, as long as they are the unambiguous ones
110	that end in semicolons.
111
112	We also decode all-caps versions of Latin letters and common symbols.
113	If a database contains the name 'P&EACUTE;REZ', we can read that and intuit
114	that it was supposed to say 'PÉREZ'. This is limited to a smaller set of
115	entities, because there are many instances where entity names are
116	case-sensitive in complicated ways.
117
118	>>> unescape_html('<tag>')
119	'<tag>'
120
121	>>> unescape_html('&Jscr;ohn &HilbertSpace;ancock')
122	'𝒥ohn ℋancock'
123
124	>>> unescape_html('&checkmark;')
125	'✓'
126
127	>>> unescape_html('Pérez')
128	'Pérez'
129
130	>>> unescape_html('P&EACUTE;REZ')
131	'PÉREZ'
132
133	>>> unescape_html('BUNDESSTRA&SZLIG;E')
134	'BUNDESSTRASSE'
135
136	>>> unescape_html('ñ Ñ &NTILDE; &nTILDE;')
137	'ñ Ñ Ñ &nTILDE;'
138	"""
139	return HTML_ENTITY_RE.sub(_unescape_fixup, text)
140
141
142	ANSI_RE = re.compile("\033\\[((?:\\d\|;)*)([a-zA-Z])")

Callers 1

test_entitiesFunction · 0.90

Calls

no outgoing calls

Tested by 1

test_entitiesFunction · 0.72