hub / github.com/MaartenGr/BERTopic / _sort_mappings_by_frequency

Method _sort_mappings_by_frequency

bertopic/_bertopic.py:4726–4765 · view source on GitHub ↗

Reorder mappings by their frequency. For example, if topic 88 was mapped to topic 5 and topic 5 turns out to be the largest topic, then topic 5 will be topic 0. The second largest will be topic 1, etc. If there are no mappings since no reduction of topics

(self, documents: pd.DataFrame)

Source from the content-addressed store, hash-verified

4724	return documents
4725
4726	def _sort_mappings_by_frequency(self, documents: pd.DataFrame) -> pd.DataFrame:
4727	"""Reorder mappings by their frequency.
4728
4729	For example, if topic 88 was mapped to topic
4730	5 and topic 5 turns out to be the largest topic,
4731	then topic 5 will be topic 0. The second largest
4732	will be topic 1, etc.
4733
4734	If there are no mappings since no reduction of topics
4735	took place, then the topics will simply be ordered
4736	by their frequency and will get the topic ids based
4737	on that order.
4738
4739	This means that -1 will remain the outlier class, and
4740	that the rest of the topics will be in descending order
4741	of ids and frequency.
4742
4743	Arguments:
4744	documents: Dataframe with documents and their corresponding IDs and Topics
4745
4746	Returns:
4747	documents: Updated dataframe with documents and the mapped
4748	and re-ordered topic ids
4749	"""
4750	# No need to sort if it's the first pass of zero-shot topic modeling
4751	nr_zeroshot = len(self._topic_id_to_zeroshot_topic_idx)
4752	if self._is_zeroshot and not self.nr_topics and nr_zeroshot > 0:
4753	return documents
4754
4755	# Map topics based on frequency
4756	self._update_topic_size(documents)
4757	df = pd.DataFrame(self.topic_sizes_.items(), columns=["Old_Topic", "Size"]).sort_values("Size", ascending=False)
4758	df = df[df.Old_Topic != -1]
4759	sorted_topics = {{-1: -1}, dict(zip(df.Old_Topic, range(len(df))))}
4760	self.topic_mapper_.add_mappings(sorted_topics, topic_model=self)
4761
4762	# Map documents
4763	documents.Topic = documents.Topic.map(sorted_topics).fillna(documents.Topic).astype(int)
4764	self._update_topic_size(documents)
4765	return documents
4766
4767	def _map_probabilities(
4768	self, probabilities: Union[np.ndarray, None], original_topics: bool = False

Callers 6

fit_transformMethod · 0.95

merge_topicsMethod · 0.95

delete_topicsMethod · 0.95

_reduce_to_n_topicsMethod · 0.95

_auto_reduce_topicsMethod · 0.95

test_topic_reduction_edge_casesFunction · 0.80

Calls 2

_update_topic_sizeMethod · 0.95

add_mappingsMethod · 0.80

Tested by 1

test_topic_reduction_edge_casesFunction · 0.64