hub / github.com/MaartenGr/BERTopic

github.com/MaartenGr/BERTopic @v0.17.4 sqlite

repository ↗ · DeepWiki ↗ · release v0.17.4 ↗

339 symbols 1,625 edges 87 files 201 documented · 59%

README

BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports all kinds of topic modeling techniques:

Guided	Supervised	Semi-supervised
Manual	Multi-topic distributions	Hierarchical
Class-based	Dynamic	Online/Incremental
Multimodal	Multi-aspect	Text Generation/LLM
Zero-shot (new!)	Merge Models (new!)	Seed Words (new!)

Corresponding medium posts can be found here, here and here. For a more detailed overview, you can read the paper or see a brief overview.

Installation

Installation, with sentence-transformers, can be done using uv:

uv add bertopic

or with pip:

pip install bertopic

If you want to install BERTopic with other embedding models, you can choose one of the following:

# Choose an embedding backend
pip install bertopic[flair,gensim,spacy,use]

# Topic modeling with images
pip install bertopic[vision]

For a light-weight installation without transformers, UMAP and/or HDBSCAN (for training with Model2Vec or inference), see this tutorial.

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation or you can follow along with one of the examples below:

Name	Link
Start Here - Best Practices in BERTopic
🆕 New! - Topic Modeling on Large Data (GPU Acceleration)
🆕 New! - Topic Modeling with Llama 2 🦙
🆕 New! - Topic Modeling with Quantized LLMs
Topic Modeling with BERTopic
(Custom) Embedding Models in BERTopic
Advanced Customization in BERTopic
(semi-)Supervised Topic Modeling with BERTopic
Dynamic Topic Modeling with Trump's Tweets
Topic Modeling arXiv Abstracts

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access all of the topics together with their topic representations:

>>> topic_model.get_topic_info()

Topic   Count   Name
-1  4630    -1_can_your_will_any
0   693 49_windows_drive_dos_file
1   466 32_jesus_bible_christian_faith
2   441 2_space_launch_orbit_lunar
3   381 22_key_encryption_keys_encrypted
...

The -1 topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used for interpreting that topic. Next, let's take a look at the most frequent topic that was generated:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

Using .get_document_info, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:

>>> topic_model.get_document_info(docs)

Document                               Topic    Name                            Top_n_words                     Probability    ...
I am sure some bashers of Pens...   0   0_game_team_games_season    game - team - games...          0.200010       ...
My brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...
Finally you said what you dream...  -1     -1_can_your_will_any         can - your - will...            0.807259       ...
Think! It's the SCSI card doing...  49     49_windows_drive_dos_file    windows - drive - docs...   0.071746       ...
1) I have an old Jasmine drive...   49     49_windows_drive_dos_file    windows - drive - docs...   0.038983       ...

🔥 Tip: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Fine-tune Topic Representations

In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is KeyBERTInspired, which for many users increases the coherence and reduces stopwords from the resulting topic representations:

```python from bertopic.representation import KeyBERTInspired

Fine-tune your topic representations

representation_model = KeyBERTInspired() topic_model = BERTopic(representation_model=representation_model)


However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:

```python
import openai
from bertopic.representation import OpenAI

# Fine-tune topic representations with GPT
client = openai.OpenAI(api_key="sk-...")
representation_model = OpenAI(client, model="gpt-4o-mini", chat=True)
topic_model = BERTopic(representation_model=representation_model)

🔥 Tip: Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic.

Visualizations

After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the many visualization options in BERTopic. For example, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Modularity

By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:

https://user-images.githubusercontent.com/25746895/218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4

You can swap out any of these models or even remove them entirely. The following steps are completely modular:

Embedding documents
Reducing dimensionality of embeddings
Clustering reduced embeddings into topics
Tokenization of topics
Weight tokens
Represent topics with one or multiple representations

Functionality

BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview of all methods and a short description of its purpose.

Common

Below, you will find an overview of common functions in BERTopic.

Method	Code
Fit the model	`.fit(docs)`
Fit the model and predict documents	`.fit_transform(docs)`
Predict new documents	`.transform([new_doc])`
Access single topic	`.get_topic(topic=12)`
Access all topics	`.get_topics()`
Get topic freq	`.get_topic_freq()`
Get all topic information	`.get_topic_info()`
Get all document information	`.get_document_info(docs)`
Get representative docs per topic	`.get_representative_docs()`
Update topic representation	`.update_topics(docs, n_gram_range=(1, 3))`
Generate topic labels	`.generate_topic_labels()`
Set topic labels	`.set_topic_labels(my_custom_labels)`
Merge topics	`.merge_topics(docs, topics_to_merge)`
Reduce nr of topics	`.reduce_topics(docs, nr_topics=30)`
Redu

Core symbols most depended-on inside this repo

get_topic

called by 32

bertopic/_bertopic.py

bertopic/_bertopic.py

transform

called by 21

bertopic/_bertopic.py

fit

called by 20

bertopic/_bertopic.py

get_topic_info

called by 15

bertopic/_bertopic.py

_update_topic_size

called by 14

bertopic/_bertopic.py

Shape

Method 181

Function 119

Class 39

Languages

Python100%

Modules by API surface

bertopic/_bertopic.py79 symbols

bertopic/_utils.py21 symbols

bertopic/_save_utils.py17 symbols

tests/conftest.py15 symbols

bertopic/representation/_visual.py9 symbols

bertopic/backend/_hftransformers.py8 symbols

bertopic/backend/_multimodal.py7 symbols

tests/test_utils.py6 symbols

tests/test_representation/test_representations.py6 symbols

bertopic/representation/_openai.py6 symbols

bertopic/representation/_litellm.py6 symbols

bertopic/representation/_keybert.py6 symbols

Dependencies from manifests, versioned

hdbscan0.8.29 · 1×

numpy1.20.0 · 1×

pandas1.1.5 · 1×

plotly4.7.0 · 1×

scikit-learn1.0 · 1×

sentence-transformers0.4.1 · 1×

tqdm4.41.1 · 1×

umap-learn0.5.0 · 1×

For agents

$ claude mcp add BERTopic \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact