MCPcopy
hub / github.com/MaartenGr/BERTopic / reduce_topics

Method reduce_topics

bertopic/_bertopic.py:2313–2378  ·  view source on GitHub ↗

Reduce the number of topics to a fixed number of topics or automatically. If nr_topics is an integer, then the number of topics is reduced to nr_topics using `AgglomerativeClustering` on the cosine distance matrix of the topic c-TF-IDF or semantic embeddings.

(
        self,
        docs: List[str],
        nr_topics: Union[int, str] = 20,
        images: List[str] | None = None,
        use_ctfidf: bool = False,
    )

Source from the content-addressed store, hash-verified

2311 self.ctfidf_model._idf_diag = self.ctfidf_model._idf_diag[mask]
2312
2313 def reduce_topics(
2314 self,
2315 docs: List[str],
2316 nr_topics: Union[int, str] = 20,
2317 images: List[str] | None = None,
2318 use_ctfidf: bool = False,
2319 ) -> None:
2320 """Reduce the number of topics to a fixed number of topics
2321 or automatically.
2322
2323 If nr_topics is an integer, then the number of topics is reduced
2324 to nr_topics using `AgglomerativeClustering` on the cosine distance matrix
2325 of the topic c-TF-IDF or semantic embeddings.
2326
2327 If nr_topics is `"auto"`, then HDBSCAN is used to automatically
2328 reduce the number of topics by running it on the topic embeddings.
2329
2330 The topics, their sizes, and representations are updated.
2331
2332 Arguments:
2333 docs: The docs you used when calling either `fit` or `fit_transform`
2334 nr_topics: The number of topics you want reduced to
2335 images: A list of paths to the images used when calling either
2336 `fit` or `fit_transform`
2337 use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the
2338 embeddings from the embedding model are used.
2339
2340 Updates:
2341 topics_ : Assigns topics to their merged representations.
2342 probabilities_ : Assigns probabilities to their merged representations.
2343
2344 Examples:
2345 You can further reduce the topics by passing the documents with their
2346 topics and probabilities (if they were calculated):
2347
2348 ```python
2349 topic_model.reduce_topics(docs, nr_topics=30)
2350 ```
2351
2352 You can then access the updated topics and probabilities with:
2353
2354 ```python
2355 topics = topic_model.topics_
2356 probabilities = topic_model.probabilities_
2357 ```
2358 """
2359 check_is_fitted(self)
2360 check_documents_type(docs)
2361
2362 self.nr_topics = nr_topics
2363 documents = pd.DataFrame(
2364 {
2365 "Document": docs,
2366 "Topic": self.topics_,
2367 "Image": images,
2368 "ID": range(len(docs)),
2369 }
2370 )

Callers 3

test_full_modelFunction · 0.80
reduced_topic_modelFunction · 0.80
test_topic_reductionFunction · 0.80

Calls 5

_reduce_topicsMethod · 0.95
_map_probabilitiesMethod · 0.95
check_is_fittedFunction · 0.90
check_documents_typeFunction · 0.90

Tested by 3

test_full_modelFunction · 0.64
reduced_topic_modelFunction · 0.64
test_topic_reductionFunction · 0.64