Function fulltext

newspaper/api.py:72–94 · view source on GitHub ↗

Takes article HTML string input and outputs the fulltext Input string is decoded via UnicodeDammit if needed

(html, language='en')

Source from the content-addressed store, hash-verified

70
71
72	def fulltext(html, language='en'):
73	"""Takes article HTML string input and outputs the fulltext
74	Input string is decoded via UnicodeDammit if needed
75	"""
76	from .cleaners import DocumentCleaner
77	from .configuration import Configuration
78	from .extractors import ContentExtractor
79	from .outputformatters import OutputFormatter
80
81	config = Configuration()
82	config.language = language
83
84	extractor = ContentExtractor(config)
85	document_cleaner = DocumentCleaner(config)
86	output_formatter = OutputFormatter(config)
87
88	doc = config.get_parser().fromstring(html)
89	doc = document_cleaner.clean(doc)
90
91	top_node = extractor.calculate_best_node(doc)
92	top_node = extractor.post_cleanup(top_node)
93	text, article_html = output_formatter.get_formatted(top_node)
94	return text

test_parse_htmlMethod · 0.90

test_chinese_fulltext_extractMethod · 0.90

test_arabic_fulltext_extractMethod · 0.90

test_spanish_fulltext_extractMethod · 0.90

get_parserMethod · 0.95

cleanMethod · 0.95

calculate_best_nodeMethod · 0.95

post_cleanupMethod · 0.95

get_formattedMethod · 0.95

ConfigurationClass · 0.85

ContentExtractorClass · 0.85

DocumentCleanerClass · 0.85

OutputFormatterClass · 0.85

fromstringMethod · 0.80

test_parse_htmlMethod · 0.72

test_chinese_fulltext_extractMethod · 0.72

test_arabic_fulltext_extractMethod · 0.72

test_spanish_fulltext_extractMethod · 0.72

searching dependent graphs…