hub / github.com/MaartenGr/BERTopic / _extract_representative_docs

Method _extract_representative_docs

bertopic/_bertopic.py:4235–4313 · view source on GitHub ↗

Approximate most representative documents per topic by sampling a subset of the documents in each topic and calculating which are most representative to their topic based on the cosine similarity between c-TF-IDF representations. Arguments: c_tf_idf: The

(
        self,
        c_tf_idf: csr_matrix,
        documents: pd.DataFrame,
        topics: Mapping[str, List[Tuple[str, float]]],
        nr_samples: int = 500,
        nr_repr_docs: int = 5,
        diversity: float | None = None,
    )

Source from the content-addressed store, hash-verified

4233	self.representative_docs_ = repr_docs
4234
4235	def _extract_representative_docs(
4236	self,
4237	c_tf_idf: csr_matrix,
4238	documents: pd.DataFrame,
4239	topics: Mapping[str, List[Tuple[str, float]]],
4240	nr_samples: int = 500,
4241	nr_repr_docs: int = 5,
4242	diversity: float \| None = None,
4243	) -> Union[List[str], List[List[int]]]:
4244	"""Approximate most representative documents per topic by sampling
4245	a subset of the documents in each topic and calculating which are
4246	most representative to their topic based on the cosine similarity between
4247	c-TF-IDF representations.
4248
4249	Arguments:
4250	c_tf_idf: The topic c-TF-IDF representation
4251	documents: All input documents
4252	topics: The candidate topics as calculated with c-TF-IDF
4253	nr_samples: The number of candidate documents to extract per topic
4254	nr_repr_docs: The number of representative documents to extract per topic
4255	diversity: The diversity between the most representative documents.
4256	If None, no MMR is used. Otherwise, accepts values between 0 and 1.
4257
4258	Returns:
4259	repr_docs_mappings: A dictionary from topic to representative documents
4260	representative_docs: A flat list of representative documents
4261	repr_doc_indices: Ordered indices of representative documents
4262	that belong to each topic
4263	repr_doc_ids: The indices of representative documents
4264	that belong to each topic
4265	"""
4266	# Sample documents per topic
4267	documents_per_topic = (
4268	documents.drop("Image", axis=1, errors="ignore")
4269	.groupby("Topic")
4270	.sample(n=nr_samples, replace=True, random_state=42)
4271	.drop_duplicates()
4272	)
4273
4274	# Find and extract documents that are most similar to the topic
4275	repr_docs = []
4276	repr_docs_indices = []
4277	repr_docs_mappings = {}
4278	repr_docs_ids = []
4279	labels = sorted(list(topics.keys()))
4280	for index, topic in enumerate(labels):
4281	# Slice data
4282	selection = documents_per_topic.loc[documents_per_topic.Topic == topic, :]
4283	selected_docs = selection["Document"].values
4284	selected_docs_ids = selection.index.tolist()
4285
4286	# Calculate similarity
4287	nr_docs = nr_repr_docs if len(selected_docs) > nr_repr_docs else len(selected_docs)
4288	bow = self.vectorizer_model.transform(selected_docs)
4289	ctfidf = self.ctfidf_model.transform(bow)
4290	sim_matrix = cosine_similarity(ctfidf, c_tf_idf[index])
4291
4292	# Use MMR to find representative but diverse documents

Callers 9

_save_representative_docsMethod · 0.95

extract_topicsMethod · 0.80

Calls 2

mmrFunction · 0.90

transformMethod · 0.45

Tested by

no test coverage detected