hub / github.com/MaartenGr/BERTopic / reduce_topics

Method reduce_topics

bertopic/_bertopic.py:2313–2378 · view source on GitHub ↗

Reduce the number of topics to a fixed number of topics or automatically. If nr_topics is an integer, then the number of topics is reduced to nr_topics using `AgglomerativeClustering` on the cosine distance matrix of the topic c-TF-IDF or semantic embeddings.

(
        self,
        docs: List[str],
        nr_topics: Union[int, str] = 20,
        images: List[str] | None = None,
        use_ctfidf: bool = False,
    )

Source from the content-addressed store, hash-verified

2311	self.ctfidf_model._idf_diag = self.ctfidf_model._idf_diag[mask]
2312
2313	def reduce_topics(
2314	self,
2315	docs: List[str],
2316	nr_topics: Union[int, str] = 20,
2317	images: List[str] \| None = None,
2318	use_ctfidf: bool = False,
2319	) -> None:
2320	"""Reduce the number of topics to a fixed number of topics
2321	or automatically.
2322
2323	If nr_topics is an integer, then the number of topics is reduced
2324	to nr_topics using `AgglomerativeClustering` on the cosine distance matrix
2325	of the topic c-TF-IDF or semantic embeddings.
2326
2327	If nr_topics is `"auto"`, then HDBSCAN is used to automatically
2328	reduce the number of topics by running it on the topic embeddings.
2329
2330	The topics, their sizes, and representations are updated.
2331
2332	Arguments:
2333	docs: The docs you used when calling either `fit` or `fit_transform`
2334	nr_topics: The number of topics you want reduced to
2335	images: A list of paths to the images used when calling either
2336	`fit` or `fit_transform`
2337	use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the
2338	embeddings from the embedding model are used.
2339
2340	Updates:
2341	topics_ : Assigns topics to their merged representations.
2342	probabilities_ : Assigns probabilities to their merged representations.
2343
2344	Examples:
2345	You can further reduce the topics by passing the documents with their
2346	topics and probabilities (if they were calculated):
2347
2348	```python
2349	topic_model.reduce_topics(docs, nr_topics=30)
2350	```
2351
2352	You can then access the updated topics and probabilities with:
2353
2354	```python
2355	topics = topic_model.topics_
2356	probabilities = topic_model.probabilities_
2357	```
2358	"""
2359	check_is_fitted(self)
2360	check_documents_type(docs)
2361
2362	self.nr_topics = nr_topics
2363	documents = pd.DataFrame(
2364	{
2365	"Document": docs,
2366	"Topic": self.topics_,
2367	"Image": images,
2368	"ID": range(len(docs)),
2369	}
2370	)

Callers 3

test_full_modelFunction · 0.80

reduced_topic_modelFunction · 0.80

test_topic_reductionFunction · 0.80

Calls 5

_reduce_topicsMethod · 0.95

_save_representative_docsMethod · 0.95

_map_probabilitiesMethod · 0.95

check_is_fittedFunction · 0.90

check_documents_typeFunction · 0.90

Tested by 3

test_full_modelFunction · 0.64

reduced_topic_modelFunction · 0.64

test_topic_reductionFunction · 0.64