hub / github.com/MaartenGr/BERTopic / partial_fit

Method partial_fit

bertopic/_bertopic.py:649–795 · view source on GitHub ↗

Fit BERTopic on a subset of the data and perform online learning with batch-like data. Online topic modeling in BERTopic is performed by using dimensionality reduction and cluster algorithms that support a `partial_fit` method in order to incrementally train the topi

(
        self,
        documents: List[str],
        embeddings: np.ndarray = None,
        y: Union[List[int], np.ndarray] = None,
    )

Source from the content-addressed store, hash-verified

647	return predictions, probabilities
648
649	def partial_fit(
650	self,
651	documents: List[str],
652	embeddings: np.ndarray = None,
653	y: Union[List[int], np.ndarray] = None,
654	):
655	"""Fit BERTopic on a subset of the data and perform online learning
656	with batch-like data.
657
658	Online topic modeling in BERTopic is performed by using dimensionality
659	reduction and cluster algorithms that support a `partial_fit` method
660	in order to incrementally train the topic model.
661
662	Likewise, the `bertopic.vectorizers.OnlineCountVectorizer` is used
663	to dynamically update its vocabulary when presented with new data.
664	It has several parameters for modeling decay and updating the
665	representations.
666
667	In other words, although the main algorithm stays the same, the training
668	procedure now works as follows:
669
670	For each subset of the data:
671
672	1. Generate embeddings with a pre-trained language model
673	2. Incrementally update the dimensionality reduction algorithm with `partial_fit`
674	3. Incrementally update the cluster algorithm with `partial_fit`
675	4. Incrementally update the OnlineCountVectorizer and apply some form of decay
676
677	Note that it is advised to use `partial_fit` with batches and
678	not single documents for the best performance.
679
680	Arguments:
681	documents: A list of documents to fit on
682	embeddings: Pre-trained document embeddings. These can be used
683	instead of the sentence-transformer model
684	y: The target class for (semi)-supervised modeling. Use -1 if no class for a
685	specific instance is specified.
686
687	Examples:
688	```python
689	from sklearn.datasets import fetch_20newsgroups
690	from sklearn.cluster import MiniBatchKMeans
691	from sklearn.decomposition import IncrementalPCA
692	from bertopic.vectorizers import OnlineCountVectorizer
693	from bertopic import BERTopic
694
695	# Prepare documents
696	docs = fetch_20newsgroups(subset=subset, remove=('headers', 'footers', 'quotes'))["data"]
697
698	# Prepare sub-models that support online learning
699	umap_model = IncrementalPCA(n_components=5)
700	cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
701	vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
702
703	topic_model = BERTopic(umap_model=umap_model,
704	hdbscan_model=cluster_model,
705	vectorizer_model=vectorizer_model)
706

Callers 4

online_topic_modelFunction · 0.95

_reduce_dimensionalityMethod · 0.45

_cluster_embeddingsMethod · 0.45

_c_tf_idfMethod · 0.45

Calls 13

_extract_embeddingsMethod · 0.95

_guided_topic_modelingMethod · 0.95

_reduce_dimensionalityMethod · 0.95

_cluster_embeddingsMethod · 0.95

_c_tf_idfMethod · 0.95

_extract_words_per_topicMethod · 0.95

_create_topic_vectorsMethod · 0.95

_update_topic_sizeMethod · 0.95

check_embeddings_shapeFunction · 0.90

select_backendFunction · 0.90

TopicMapperClass · 0.85

get_mappingsMethod · 0.80

Tested by 1

online_topic_modelFunction · 0.76