MCPcopy
hub / github.com/attardi/wikiextractor / unescape

Function unescape

wikiextractor/extract.py:715–737  ·  view source on GitHub ↗

Removes HTML or XML character references and entities from a text string. :param text The HTML (or XML) source text. :return The plain text, as a Unicode string, if necessary.

(text)

Source from the content-addressed store, hash-verified

713
714
715def unescape(text):
716 """
717 Removes HTML or XML character references and entities from a text string.
718
719 :param text The HTML (or XML) source text.
720 :return The plain text, as a Unicode string, if necessary.
721 """
722
723 def fixup(m):
724 text = m.group(0)
725 code = m.group(1)
726 try:
727 if text[1] == "#": # character reference
728 if text[2] == "x":
729 return chr(int(code[1:], 16))
730 else:
731 return chr(int(code))
732 else: # named entity
733 return chr(name2codepoint[code])
734 except:
735 return text # leave as is
736
737 return re.sub("&#?(\w+);", fixup, text)
738
739
740# Match HTML comments

Callers 2

cleanFunction · 0.85
define_templateFunction · 0.85

Calls

no outgoing calls

Tested by

no test coverage detected