BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. The default embedding model is `all-MiniLM-L6-v2` when selecting `language="engl
| 86 | |
| 87 | |
| 88 | class BERTopic: |
| 89 | """BERTopic is a topic modeling technique that leverages BERT embeddings and |
| 90 | c-TF-IDF to create dense clusters allowing for easily interpretable topics |
| 91 | whilst keeping important words in the topic descriptions. |
| 92 | |
| 93 | The default embedding model is `all-MiniLM-L6-v2` when selecting `language="english"` |
| 94 | and `paraphrase-multilingual-MiniLM-L12-v2` when selecting `language="multilingual"`. |
| 95 | |
| 96 | Attributes: |
| 97 | topics_ (List[int]) : The topics that are generated for each document after training or updating |
| 98 | the topic model. The most recent topics are tracked. |
| 99 | probabilities_ (List[float]): The probability of the assigned topic per document. These are |
| 100 | only calculated if a HDBSCAN model is used for the clustering step. |
| 101 | When `calculate_probabilities=True`, then it is the probabilities |
| 102 | of all topics per document. |
| 103 | topic_sizes_ (Mapping[int, int]) : The size of each topic. |
| 104 | topic_mapper_ (TopicMapper) : A class for tracking topics and their mappings anytime they are |
| 105 | merged, reduced, added, or removed. |
| 106 | topic_representations_ (Mapping[int, Tuple[int, float]]) : The top n terms per topic and their respective |
| 107 | c-TF-IDF values. |
| 108 | c_tf_idf_ (csr_matrix) : The topic-term matrix as calculated through c-TF-IDF. To access its respective |
| 109 | words, run `.vectorizer_model.get_feature_names()` or |
| 110 | `.vectorizer_model.get_feature_names_out()` |
| 111 | topic_labels_ (Mapping[int, str]) : The default labels for each topic. |
| 112 | custom_labels_ (List[str]) : Custom labels for each topic. |
| 113 | topic_embeddings_ (np.ndarray) : The embeddings for each topic. They are calculated by taking the |
| 114 | centroid embedding of each cluster. |
| 115 | representative_docs_ (Mapping[int, str]) : The representative documents for each topic. |
| 116 | |
| 117 | Examples: |
| 118 | ```python |
| 119 | from bertopic import BERTopic |
| 120 | from sklearn.datasets import fetch_20newsgroups |
| 121 | |
| 122 | docs = fetch_20newsgroups(subset='all')['data'] |
| 123 | topic_model = BERTopic() |
| 124 | topics, probabilities = topic_model.fit_transform(docs) |
| 125 | ``` |
| 126 | |
| 127 | If you want to use your own embedding model, use it as follows: |
| 128 | |
| 129 | ```python |
| 130 | from bertopic import BERTopic |
| 131 | from sklearn.datasets import fetch_20newsgroups |
| 132 | from sentence_transformers import SentenceTransformer |
| 133 | |
| 134 | docs = fetch_20newsgroups(subset='all')['data'] |
| 135 | sentence_model = SentenceTransformer("all-MiniLM-L6-v2") |
| 136 | topic_model = BERTopic(embedding_model=sentence_model) |
| 137 | ``` |
| 138 | |
| 139 | Due to the stochastic nature of UMAP, the results from BERTopic might differ |
| 140 | and the quality can degrade. Using your own embeddings allows you to |
| 141 | try out BERTopic several times until you find the topics that suit |
| 142 | you best. |
| 143 | """ |
| 144 | |
| 145 | def __init__( |
no outgoing calls