hub / github.com/MaartenGr/BERTopic / fit_transform

Method fit_transform

bertopic/_bertopic.py:395–543 · view source on GitHub ↗

Fit the models on a collection of documents, generate topics, and return the probabilities and topic per document. Arguments: documents: A list of documents to fit on embeddings: Pre-trained document embeddings. These can be used inste

(
        self,
        documents: List[str],
        embeddings: np.ndarray = None,
        images: List[str] | None = None,
        y: Union[List[int], np.ndarray] = None,
    )

Source from the content-addressed store, hash-verified

393	return self
394
395	def fit_transform(
396	self,
397	documents: List[str],
398	embeddings: np.ndarray = None,
399	images: List[str] \| None = None,
400	y: Union[List[int], np.ndarray] = None,
401	) -> Tuple[List[int], Union[np.ndarray, None]]:
402	"""Fit the models on a collection of documents, generate topics,
403	and return the probabilities and topic per document.
404
405	Arguments:
406	documents: A list of documents to fit on
407	embeddings: Pre-trained document embeddings. These can be used
408	instead of the sentence-transformer model
409	images: A list of paths to the images to fit on or the images themselves
410	y: The target class for (semi)-supervised modeling. Use -1 if no class for a
411	specific instance is specified.
412
413	Returns:
414	predictions: Topic predictions for each documents
415	probabilities: The probability of the assigned topic per document.
416	If `calculate_probabilities` in BERTopic is set to True, then
417	it calculates the probabilities of all topics across all documents
418	instead of only the assigned topic. This, however, slows down
419	computation and may increase memory usage.
420
421	Examples:
422	```python
423	from bertopic import BERTopic
424	from sklearn.datasets import fetch_20newsgroups
425
426	docs = fetch_20newsgroups(subset='all')['data']
427	topic_model = BERTopic()
428	topics, probs = topic_model.fit_transform(docs)
429	```
430
431	If you want to use your own embeddings, use it as follows:
432
433	```python
434	from bertopic import BERTopic
435	from sklearn.datasets import fetch_20newsgroups
436	from sentence_transformers import SentenceTransformer
437
438	# Create embeddings
439	docs = fetch_20newsgroups(subset='all')['data']
440	sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
441	embeddings = sentence_model.encode(docs, show_progress_bar=True)
442
443	# Create topic model
444	topic_model = BERTopic()
445	topics, probs = topic_model.fit_transform(docs, embeddings)
446	```
447	"""
448	if documents is not None:
449	check_documents_type(documents)
450	check_embeddings_shape(embeddings, documents)
451
452	doc_ids = range(len(documents)) if documents is not None else range(len(images))

Callers 7

fitMethod · 0.95

reduced_embeddingsFunction · 0.80

_reduce_dimensionalityMethod · 0.80

_c_tf_idfMethod · 0.80

visualize_topicsFunction · 0.80

embedMethod · 0.80

Calls 15

_extract_embeddingsMethod · 0.95

_guided_topic_modelingMethod · 0.95

_reduce_dimensionalityMethod · 0.95

_is_zeroshotMethod · 0.95

_zeroshot_topic_modelingMethod · 0.95

_cluster_embeddingsMethod · 0.95

_combine_zeroshot_topicsMethod · 0.95

_sort_mappings_by_frequencyMethod · 0.95

_images_to_textMethod · 0.95

_extract_topicsMethod · 0.95

_reduce_topicsMethod · 0.95

_create_topic_vectorsMethod · 0.95

Tested by 1

reduced_embeddingsFunction · 0.64