Merge multiple pre-trained BERTopic models into a single model. The models are merged as if they were all saved using pytorch or safetensors, so a minimal version without c-TF-IDF. To do this, we choose the first model in the list of models as a baseline. Then, we c
(cls, models, min_similarity: float = 0.7, embedding_model=None)
| 3588 | |
| 3589 | @classmethod |
| 3590 | def merge_models(cls, models, min_similarity: float = 0.7, embedding_model=None): |
| 3591 | """Merge multiple pre-trained BERTopic models into a single model. |
| 3592 | |
| 3593 | The models are merged as if they were all saved using pytorch or |
| 3594 | safetensors, so a minimal version without c-TF-IDF. |
| 3595 | |
| 3596 | To do this, we choose the first model in the list of |
| 3597 | models as a baseline. Then, we check each model whether |
| 3598 | they contain topics that are not in the baseline. |
| 3599 | This check is based on the cosine similarity between |
| 3600 | topics embeddings. If topic embeddings between two models |
| 3601 | are similar, then the topic of the second model is re-assigned |
| 3602 | to the first. If they are dissimilar, the topic of the second |
| 3603 | model is assigned to the first. |
| 3604 | |
| 3605 | In essence, we simply check whether sufficiently "new" |
| 3606 | topics emerge and add them. |
| 3607 | |
| 3608 | Arguments: |
| 3609 | models: A list of fitted BERTopic models |
| 3610 | min_similarity: The minimum similarity for when topics are merged. |
| 3611 | embedding_model: Additionally load in an embedding model if necessary. |
| 3612 | |
| 3613 | Returns: |
| 3614 | A new BERTopic model that was created as if you were |
| 3615 | loading a model from the HuggingFace Hub without c-TF-IDF |
| 3616 | |
| 3617 | Examples: |
| 3618 | ```python |
| 3619 | from bertopic import BERTopic |
| 3620 | from sklearn.datasets import fetch_20newsgroups |
| 3621 | |
| 3622 | docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] |
| 3623 | |
| 3624 | # Create three separate models |
| 3625 | topic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000]) |
| 3626 | topic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000]) |
| 3627 | topic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:]) |
| 3628 | |
| 3629 | # Combine all models into one |
| 3630 | merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3]) |
| 3631 | ``` |
| 3632 | """ |
| 3633 | |
| 3634 | def choose_backend(): |
| 3635 | """Choose the backend to use for saving the model.""" |
| 3636 | try: |
| 3637 | import torch # noqa: F401 |
| 3638 | |
| 3639 | return "pytorch" |
| 3640 | except (ModuleNotFoundError, ImportError): |
| 3641 | try: |
| 3642 | import safetensors # noqa: F401 |
| 3643 | |
| 3644 | return "safetensors" |
| 3645 | except (ModuleNotFoundError, ImportError): |
| 3646 | raise ImportError( |
| 3647 | "Neither pytorch nor safetensors is installed. " |