MCPcopy
hub / github.com/MaartenGr/BERTopic / partial_fit

Method partial_fit

bertopic/_bertopic.py:649–795  ·  view source on GitHub ↗

Fit BERTopic on a subset of the data and perform online learning with batch-like data. Online topic modeling in BERTopic is performed by using dimensionality reduction and cluster algorithms that support a `partial_fit` method in order to incrementally train the topi

(
        self,
        documents: List[str],
        embeddings: np.ndarray = None,
        y: Union[List[int], np.ndarray] = None,
    )

Source from the content-addressed store, hash-verified

647 return predictions, probabilities
648
649 def partial_fit(
650 self,
651 documents: List[str],
652 embeddings: np.ndarray = None,
653 y: Union[List[int], np.ndarray] = None,
654 ):
655 """Fit BERTopic on a subset of the data and perform online learning
656 with batch-like data.
657
658 Online topic modeling in BERTopic is performed by using dimensionality
659 reduction and cluster algorithms that support a `partial_fit` method
660 in order to incrementally train the topic model.
661
662 Likewise, the `bertopic.vectorizers.OnlineCountVectorizer` is used
663 to dynamically update its vocabulary when presented with new data.
664 It has several parameters for modeling decay and updating the
665 representations.
666
667 In other words, although the main algorithm stays the same, the training
668 procedure now works as follows:
669
670 For each subset of the data:
671
672 1. Generate embeddings with a pre-trained language model
673 2. Incrementally update the dimensionality reduction algorithm with `partial_fit`
674 3. Incrementally update the cluster algorithm with `partial_fit`
675 4. Incrementally update the OnlineCountVectorizer and apply some form of decay
676
677 Note that it is advised to use `partial_fit` with batches and
678 not single documents for the best performance.
679
680 Arguments:
681 documents: A list of documents to fit on
682 embeddings: Pre-trained document embeddings. These can be used
683 instead of the sentence-transformer model
684 y: The target class for (semi)-supervised modeling. Use -1 if no class for a
685 specific instance is specified.
686
687 Examples:
688 ```python
689 from sklearn.datasets import fetch_20newsgroups
690 from sklearn.cluster import MiniBatchKMeans
691 from sklearn.decomposition import IncrementalPCA
692 from bertopic.vectorizers import OnlineCountVectorizer
693 from bertopic import BERTopic
694
695 # Prepare documents
696 docs = fetch_20newsgroups(subset=subset, remove=('headers', 'footers', 'quotes'))["data"]
697
698 # Prepare sub-models that support online learning
699 umap_model = IncrementalPCA(n_components=5)
700 cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
701 vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
702
703 topic_model = BERTopic(umap_model=umap_model,
704 hdbscan_model=cluster_model,
705 vectorizer_model=vectorizer_model)
706

Callers 4

online_topic_modelFunction · 0.95
_cluster_embeddingsMethod · 0.45
_c_tf_idfMethod · 0.45

Calls 13

_extract_embeddingsMethod · 0.95
_cluster_embeddingsMethod · 0.95
_c_tf_idfMethod · 0.95
_create_topic_vectorsMethod · 0.95
_update_topic_sizeMethod · 0.95
check_embeddings_shapeFunction · 0.90
select_backendFunction · 0.90
TopicMapperClass · 0.85
get_mappingsMethod · 0.80

Tested by 1

online_topic_modelFunction · 0.76