MCPcopy
hub / github.com/MaartenGr/BERTopic / BERTopic

Class BERTopic

bertopic/_bertopic.py:88–4884  ·  view source on GitHub ↗

BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. The default embedding model is `all-MiniLM-L6-v2` when selecting `language="engl

Source from the content-addressed store, hash-verified

86
87
88class BERTopic:
89 """BERTopic is a topic modeling technique that leverages BERT embeddings and
90 c-TF-IDF to create dense clusters allowing for easily interpretable topics
91 whilst keeping important words in the topic descriptions.
92
93 The default embedding model is `all-MiniLM-L6-v2` when selecting `language="english"`
94 and `paraphrase-multilingual-MiniLM-L12-v2` when selecting `language="multilingual"`.
95
96 Attributes:
97 topics_ (List[int]) : The topics that are generated for each document after training or updating
98 the topic model. The most recent topics are tracked.
99 probabilities_ (List[float]): The probability of the assigned topic per document. These are
100 only calculated if a HDBSCAN model is used for the clustering step.
101 When `calculate_probabilities=True`, then it is the probabilities
102 of all topics per document.
103 topic_sizes_ (Mapping[int, int]) : The size of each topic.
104 topic_mapper_ (TopicMapper) : A class for tracking topics and their mappings anytime they are
105 merged, reduced, added, or removed.
106 topic_representations_ (Mapping[int, Tuple[int, float]]) : The top n terms per topic and their respective
107 c-TF-IDF values.
108 c_tf_idf_ (csr_matrix) : The topic-term matrix as calculated through c-TF-IDF. To access its respective
109 words, run `.vectorizer_model.get_feature_names()` or
110 `.vectorizer_model.get_feature_names_out()`
111 topic_labels_ (Mapping[int, str]) : The default labels for each topic.
112 custom_labels_ (List[str]) : Custom labels for each topic.
113 topic_embeddings_ (np.ndarray) : The embeddings for each topic. They are calculated by taking the
114 centroid embedding of each cluster.
115 representative_docs_ (Mapping[int, str]) : The representative documents for each topic.
116
117 Examples:
118 ```python
119 from bertopic import BERTopic
120 from sklearn.datasets import fetch_20newsgroups
121
122 docs = fetch_20newsgroups(subset='all')['data']
123 topic_model = BERTopic()
124 topics, probabilities = topic_model.fit_transform(docs)
125 ```
126
127 If you want to use your own embedding model, use it as follows:
128
129 ```python
130 from bertopic import BERTopic
131 from sklearn.datasets import fetch_20newsgroups
132 from sentence_transformers import SentenceTransformer
133
134 docs = fetch_20newsgroups(subset='all')['data']
135 sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
136 topic_model = BERTopic(embedding_model=sentence_model)
137 ```
138
139 Due to the stochastic nature of UMAP, the results from BERTopic might differ
140 and the quality can degrade. Using your own embeddings allows you to
141 try out BERTopic several times until you find the topics that suit
142 you best.
143 """
144
145 def __init__(

Callers 15

test_load_save_modelFunction · 0.90
test_get_paramsFunction · 0.90
test_no_plotlyFunction · 0.90
base_topic_modelFunction · 0.90
zeroshot_topic_modelFunction · 0.90
custom_topic_modelFunction · 0.90
kmeans_pca_topic_modelFunction · 0.90
supervised_topic_modelFunction · 0.90
online_topic_modelFunction · 0.90
cuml_base_topic_modelFunction · 0.90

Calls

no outgoing calls

Tested by 15

test_load_save_modelFunction · 0.72
test_get_paramsFunction · 0.72
test_no_plotlyFunction · 0.72
base_topic_modelFunction · 0.72
zeroshot_topic_modelFunction · 0.72
custom_topic_modelFunction · 0.72
kmeans_pca_topic_modelFunction · 0.72
supervised_topic_modelFunction · 0.72
online_topic_modelFunction · 0.72
cuml_base_topic_modelFunction · 0.72