Reduce the number of topics to a fixed number of topics or automatically. If nr_topics is an integer, then the number of topics is reduced to nr_topics using `AgglomerativeClustering` on the cosine distance matrix of the topic c-TF-IDF or semantic embeddings.
(
self,
docs: List[str],
nr_topics: Union[int, str] = 20,
images: List[str] | None = None,
use_ctfidf: bool = False,
)
| 2311 | self.ctfidf_model._idf_diag = self.ctfidf_model._idf_diag[mask] |
| 2312 | |
| 2313 | def reduce_topics( |
| 2314 | self, |
| 2315 | docs: List[str], |
| 2316 | nr_topics: Union[int, str] = 20, |
| 2317 | images: List[str] | None = None, |
| 2318 | use_ctfidf: bool = False, |
| 2319 | ) -> None: |
| 2320 | """Reduce the number of topics to a fixed number of topics |
| 2321 | or automatically. |
| 2322 | |
| 2323 | If nr_topics is an integer, then the number of topics is reduced |
| 2324 | to nr_topics using `AgglomerativeClustering` on the cosine distance matrix |
| 2325 | of the topic c-TF-IDF or semantic embeddings. |
| 2326 | |
| 2327 | If nr_topics is `"auto"`, then HDBSCAN is used to automatically |
| 2328 | reduce the number of topics by running it on the topic embeddings. |
| 2329 | |
| 2330 | The topics, their sizes, and representations are updated. |
| 2331 | |
| 2332 | Arguments: |
| 2333 | docs: The docs you used when calling either `fit` or `fit_transform` |
| 2334 | nr_topics: The number of topics you want reduced to |
| 2335 | images: A list of paths to the images used when calling either |
| 2336 | `fit` or `fit_transform` |
| 2337 | use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the |
| 2338 | embeddings from the embedding model are used. |
| 2339 | |
| 2340 | Updates: |
| 2341 | topics_ : Assigns topics to their merged representations. |
| 2342 | probabilities_ : Assigns probabilities to their merged representations. |
| 2343 | |
| 2344 | Examples: |
| 2345 | You can further reduce the topics by passing the documents with their |
| 2346 | topics and probabilities (if they were calculated): |
| 2347 | |
| 2348 | ```python |
| 2349 | topic_model.reduce_topics(docs, nr_topics=30) |
| 2350 | ``` |
| 2351 | |
| 2352 | You can then access the updated topics and probabilities with: |
| 2353 | |
| 2354 | ```python |
| 2355 | topics = topic_model.topics_ |
| 2356 | probabilities = topic_model.probabilities_ |
| 2357 | ``` |
| 2358 | """ |
| 2359 | check_is_fitted(self) |
| 2360 | check_documents_type(docs) |
| 2361 | |
| 2362 | self.nr_topics = nr_topics |
| 2363 | documents = pd.DataFrame( |
| 2364 | { |
| 2365 | "Document": docs, |
| 2366 | "Topic": self.topics_, |
| 2367 | "Image": images, |
| 2368 | "ID": range(len(docs)), |
| 2369 | } |
| 2370 | ) |