hub / github.com/codelucas/newspaper / memoize_articles

Function memoize_articles

newspaper/utils.py:258–305 · view source on GitHub ↗

When we parse the links in an page, on the 2nd run and later, check the links of previous runs. If they match, it means the link must not be an article, because article urls change as time passes. This method also uniquifies articles.

(source, articles)

Source from the content-addressed store, hash-verified

256
257
258	def memoize_articles(source, articles):
259	"""When we parse the <a> links in an <html> page, on the 2nd run
260	and later, check the <a> links of previous runs. If they match,
261	it means the link must not be an article, because article urls
262	change as time passes. This method also uniquifies articles.
263	"""
264	source_domain = source.domain
265	config = source.config
266
267	if len(articles) == 0:
268	return []
269
270	memo = {}
271	cur_articles = {article.url: article for article in articles}
272	d_pth = os.path.join(settings.MEMO_DIR, domain_to_filename(source_domain))
273
274	if os.path.exists(d_pth):
275	f = codecs.open(d_pth, 'r', 'utf8')
276	urls = f.readlines()
277	f.close()
278	urls = [u.strip() for u in urls]
279
280	memo = {url: True for url in urls}
281	# prev_length = len(memo)
282	for url, article in list(cur_articles.items()):
283	if memo.get(url):
284	del cur_articles[url]
285
286	valid_urls = list(memo.keys()) + list(cur_articles.keys())
287
288	memo_text = '\r\n'.join(
289	[href.strip() for href in (valid_urls)])
290	# Our first run with memoization, save every url as valid
291	else:
292	memo_text = '\r\n'.join(
293	[href.strip() for href in list(cur_articles.keys())])
294
295	# new_length = len(cur_articles)
296	if len(memo) > config.MAX_FILE_MEMO:
297	# We still keep current batch of articles though!
298	log.critical('memo overflow, dumping')
299	memo_text = ''
300
301	# TODO if source: source.write_upload_times(prev_length, new_length)
302	ff = codecs.open(d_pth, 'w', 'utf-8')
303	ff.write(memo_text)
304	ff.close()
305	return list(cur_articles.values())
306
307
308	def get_useragent():

Callers

nothing calls this directly

Calls 2

domain_to_filenameFunction · 0.85

joinMethod · 0.80

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…