MCPcopy
hub / github.com/MaartenGr/BERTopic / fit_transform

Method fit_transform

bertopic/_bertopic.py:395–543  ·  view source on GitHub ↗

Fit the models on a collection of documents, generate topics, and return the probabilities and topic per document. Arguments: documents: A list of documents to fit on embeddings: Pre-trained document embeddings. These can be used inste

(
        self,
        documents: List[str],
        embeddings: np.ndarray = None,
        images: List[str] | None = None,
        y: Union[List[int], np.ndarray] = None,
    )

Source from the content-addressed store, hash-verified

393 return self
394
395 def fit_transform(
396 self,
397 documents: List[str],
398 embeddings: np.ndarray = None,
399 images: List[str] | None = None,
400 y: Union[List[int], np.ndarray] = None,
401 ) -> Tuple[List[int], Union[np.ndarray, None]]:
402 """Fit the models on a collection of documents, generate topics,
403 and return the probabilities and topic per document.
404
405 Arguments:
406 documents: A list of documents to fit on
407 embeddings: Pre-trained document embeddings. These can be used
408 instead of the sentence-transformer model
409 images: A list of paths to the images to fit on or the images themselves
410 y: The target class for (semi)-supervised modeling. Use -1 if no class for a
411 specific instance is specified.
412
413 Returns:
414 predictions: Topic predictions for each documents
415 probabilities: The probability of the assigned topic per document.
416 If `calculate_probabilities` in BERTopic is set to True, then
417 it calculates the probabilities of all topics across all documents
418 instead of only the assigned topic. This, however, slows down
419 computation and may increase memory usage.
420
421 Examples:
422 ```python
423 from bertopic import BERTopic
424 from sklearn.datasets import fetch_20newsgroups
425
426 docs = fetch_20newsgroups(subset='all')['data']
427 topic_model = BERTopic()
428 topics, probs = topic_model.fit_transform(docs)
429 ```
430
431 If you want to use your own embeddings, use it as follows:
432
433 ```python
434 from bertopic import BERTopic
435 from sklearn.datasets import fetch_20newsgroups
436 from sentence_transformers import SentenceTransformer
437
438 # Create embeddings
439 docs = fetch_20newsgroups(subset='all')['data']
440 sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
441 embeddings = sentence_model.encode(docs, show_progress_bar=True)
442
443 # Create topic model
444 topic_model = BERTopic()
445 topics, probs = topic_model.fit_transform(docs, embeddings)
446 ```
447 """
448 if documents is not None:
449 check_documents_type(documents)
450 check_embeddings_shape(embeddings, documents)
451
452 doc_ids = range(len(documents)) if documents is not None else range(len(images))

Callers 7

fitMethod · 0.95
reduced_embeddingsFunction · 0.80
_c_tf_idfMethod · 0.80
visualize_topicsFunction · 0.80
embedMethod · 0.80
embedMethod · 0.80

Calls 15

_extract_embeddingsMethod · 0.95
_is_zeroshotMethod · 0.95
_cluster_embeddingsMethod · 0.95
_images_to_textMethod · 0.95
_extract_topicsMethod · 0.95
_reduce_topicsMethod · 0.95
_create_topic_vectorsMethod · 0.95

Tested by 1

reduced_embeddingsFunction · 0.64