hub / github.com/MaartenGr/BERTopic / BERTopic

Class BERTopic

bertopic/_bertopic.py:88–4884 · view source on GitHub ↗

BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. The default embedding model is `all-MiniLM-L6-v2` when selecting `language="engl

Source from the content-addressed store, hash-verified

86
87
88	class BERTopic:
89	"""BERTopic is a topic modeling technique that leverages BERT embeddings and
90	c-TF-IDF to create dense clusters allowing for easily interpretable topics
91	whilst keeping important words in the topic descriptions.
92
93	The default embedding model is `all-MiniLM-L6-v2` when selecting `language="english"`
94	and `paraphrase-multilingual-MiniLM-L12-v2` when selecting `language="multilingual"`.
95
96	Attributes:
97	topics_ (List[int]) : The topics that are generated for each document after training or updating
98	the topic model. The most recent topics are tracked.
99	probabilities_ (List[float]): The probability of the assigned topic per document. These are
100	only calculated if a HDBSCAN model is used for the clustering step.
101	When `calculate_probabilities=True`, then it is the probabilities
102	of all topics per document.
103	topic_sizes_ (Mapping[int, int]) : The size of each topic.
104	topic_mapper_ (TopicMapper) : A class for tracking topics and their mappings anytime they are
105	merged, reduced, added, or removed.
106	topic_representations_ (Mapping[int, Tuple[int, float]]) : The top n terms per topic and their respective
107	c-TF-IDF values.
108	c_tf_idf_ (csr_matrix) : The topic-term matrix as calculated through c-TF-IDF. To access its respective
109	words, run `.vectorizer_model.get_feature_names()` or
110	`.vectorizer_model.get_feature_names_out()`
111	topic_labels_ (Mapping[int, str]) : The default labels for each topic.
112	custom_labels_ (List[str]) : Custom labels for each topic.
113	topic_embeddings_ (np.ndarray) : The embeddings for each topic. They are calculated by taking the
114	centroid embedding of each cluster.
115	representative_docs_ (Mapping[int, str]) : The representative documents for each topic.
116
117	Examples:
118	```python
119	from bertopic import BERTopic
120	from sklearn.datasets import fetch_20newsgroups
121
122	docs = fetch_20newsgroups(subset='all')['data']
123	topic_model = BERTopic()
124	topics, probabilities = topic_model.fit_transform(docs)
125	```
126
127	If you want to use your own embedding model, use it as follows:
128
129	```python
130	from bertopic import BERTopic
131	from sklearn.datasets import fetch_20newsgroups
132	from sentence_transformers import SentenceTransformer
133
134	docs = fetch_20newsgroups(subset='all')['data']
135	sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
136	topic_model = BERTopic(embedding_model=sentence_model)
137	```
138
139	Due to the stochastic nature of UMAP, the results from BERTopic might differ
140	and the quality can degrade. Using your own embeddings allows you to
141	try out BERTopic several times until you find the topics that suit
142	you best.
143	"""
144
145	def __init__(

Callers 15

test_load_save_modelFunction · 0.90

test_get_paramsFunction · 0.90

test_no_plotlyFunction · 0.90

base_topic_modelFunction · 0.90

zeroshot_topic_modelFunction · 0.90

custom_topic_modelFunction · 0.90

representation_topic_modelFunction · 0.90

kmeans_pca_topic_modelFunction · 0.90

supervised_topic_modelFunction · 0.90

online_topic_modelFunction · 0.90

cuml_base_topic_modelFunction · 0.90

test_extract_incorrect_embeddingsFunction · 0.90

Calls

no outgoing calls

Tested by 15

test_load_save_modelFunction · 0.72

test_get_paramsFunction · 0.72

test_no_plotlyFunction · 0.72

base_topic_modelFunction · 0.72

zeroshot_topic_modelFunction · 0.72

custom_topic_modelFunction · 0.72

representation_topic_modelFunction · 0.72

kmeans_pca_topic_modelFunction · 0.72

supervised_topic_modelFunction · 0.72

online_topic_modelFunction · 0.72

cuml_base_topic_modelFunction · 0.72

test_extract_incorrect_embeddingsFunction · 0.72