hub / github.com/MaartenGr/BERTopic / _guided_topic_modeling

Method _guided_topic_modeling

bertopic/_bertopic.py:4137–4175 · view source on GitHub ↗

Apply Guided Topic Modeling. We transform the seeded topics to embeddings using the same embedder as used for generating document embeddings. Then, we apply cosine similarity between the embeddings and set labels for documents that are more similar to one of

(self, embeddings: np.ndarray)

Source from the content-addressed store, hash-verified

4135	return documents, embeddings
4136
4137	def _guided_topic_modeling(self, embeddings: np.ndarray) -> Tuple[List[int], np.array]:
4138	"""Apply Guided Topic Modeling.
4139
4140	We transform the seeded topics to embeddings using the
4141	same embedder as used for generating document embeddings.
4142
4143	Then, we apply cosine similarity between the embeddings
4144	and set labels for documents that are more similar to
4145	one of the topics than the average document.
4146
4147	If a document is more similar to the average document
4148	than any of the topics, it gets the -1 label and is
4149	thereby not included in UMAP.
4150
4151	Arguments:
4152	embeddings: The document embeddings
4153
4154	Returns:
4155	y: The labels for each seeded topic
4156	embeddings: Updated embeddings
4157	"""
4158	logger.info("Guided - Find embeddings highly related to seeded topics.")
4159	# Create embeddings from the seeded topics
4160	seed_topic_list = [" ".join(seed_topic) for seed_topic in self.seed_topic_list]
4161	seed_topic_embeddings = self._extract_embeddings(seed_topic_list, verbose=self.verbose)
4162	seed_topic_embeddings = np.vstack([seed_topic_embeddings, embeddings.mean(axis=0)])
4163
4164	# Label documents that are most similar to one of the seeded topics
4165	sim_matrix = cosine_similarity(embeddings, seed_topic_embeddings)
4166	y = [np.argmax(sim_matrix[index]) for index in range(sim_matrix.shape[0])]
4167	y = [val if val != len(seed_topic_list) else -1 for val in y]
4168
4169	# Average the document embeddings related to the seeded topics with the
4170	# embedding of the seeded topic to force the documents in a cluster
4171	for seed_topic in range(len(seed_topic_list)):
4172	indices = [index for index, topic in enumerate(y) if topic == seed_topic]
4173	embeddings[indices] = embeddings[indices] * 0.75 + seed_topic_embeddings[seed_topic] * 0.25
4174	logger.info("Guided - Completed \u2713")
4175	return y, embeddings
4176
4177	def _extract_topics(
4178	self,

Callers 2

fit_transformMethod · 0.95

partial_fitMethod · 0.95

Calls 2

_extract_embeddingsMethod · 0.95

infoMethod · 0.80

Tested by

no test coverage detected