MCPcopy
hub / github.com/codelucas/newspaper / is_boostable

Method is_boostable

newspaper/extractors.py:729–754  ·  view source on GitHub ↗

Alot of times the first paragraph might be the caption under an image so we'll want to make sure if we're going to boost a parent node that it should be connected to other paragraphs, at least for the first n paragraphs so we'll want to make sure that the next sibling is a

(self, node)

Source from the content-addressed store, hash-verified

727 return top_node
728
729 def is_boostable(self, node):
730 """Alot of times the first paragraph might be the caption under an image
731 so we'll want to make sure if we're going to boost a parent node that
732 it should be connected to other paragraphs, at least for the first n
733 paragraphs so we'll want to make sure that the next sibling is a
734 paragraph and has at least some substantial weight to it.
735 """
736 para = "p"
737 steps_away = 0
738 minimum_stopword_count = 5
739 max_stepsaway_from_node = 3
740
741 nodes = self.walk_siblings(node)
742 for current_node in nodes:
743 # <p>
744 current_node_tag = self.parser.getTag(current_node)
745 if current_node_tag == para:
746 if steps_away >= max_stepsaway_from_node:
747 return False
748 paraText = self.parser.getText(current_node)
749 word_stats = self.stopwords_class(language=self.language).\
750 get_stopword_count(paraText)
751 if word_stats.get_stopword_count() > minimum_stopword_count:
752 return True
753 steps_away += 1
754 return False
755
756 def walk_siblings(self, node):
757 current_sibling = self.parser.previousSibling(node)

Callers 1

calculate_best_nodeMethod · 0.95

Calls 4

walk_siblingsMethod · 0.95
getTagMethod · 0.80
getTextMethod · 0.80
get_stopword_countMethod · 0.45

Tested by

no test coverage detected