Fit the models on a collection of documents, generate topics, and return the probabilities and topic per document. Arguments: documents: A list of documents to fit on embeddings: Pre-trained document embeddings. These can be used inste
(
self,
documents: List[str],
embeddings: np.ndarray = None,
images: List[str] | None = None,
y: Union[List[int], np.ndarray] = None,
)
| 393 | return self |
| 394 | |
| 395 | def fit_transform( |
| 396 | self, |
| 397 | documents: List[str], |
| 398 | embeddings: np.ndarray = None, |
| 399 | images: List[str] | None = None, |
| 400 | y: Union[List[int], np.ndarray] = None, |
| 401 | ) -> Tuple[List[int], Union[np.ndarray, None]]: |
| 402 | """Fit the models on a collection of documents, generate topics, |
| 403 | and return the probabilities and topic per document. |
| 404 | |
| 405 | Arguments: |
| 406 | documents: A list of documents to fit on |
| 407 | embeddings: Pre-trained document embeddings. These can be used |
| 408 | instead of the sentence-transformer model |
| 409 | images: A list of paths to the images to fit on or the images themselves |
| 410 | y: The target class for (semi)-supervised modeling. Use -1 if no class for a |
| 411 | specific instance is specified. |
| 412 | |
| 413 | Returns: |
| 414 | predictions: Topic predictions for each documents |
| 415 | probabilities: The probability of the assigned topic per document. |
| 416 | If `calculate_probabilities` in BERTopic is set to True, then |
| 417 | it calculates the probabilities of all topics across all documents |
| 418 | instead of only the assigned topic. This, however, slows down |
| 419 | computation and may increase memory usage. |
| 420 | |
| 421 | Examples: |
| 422 | ```python |
| 423 | from bertopic import BERTopic |
| 424 | from sklearn.datasets import fetch_20newsgroups |
| 425 | |
| 426 | docs = fetch_20newsgroups(subset='all')['data'] |
| 427 | topic_model = BERTopic() |
| 428 | topics, probs = topic_model.fit_transform(docs) |
| 429 | ``` |
| 430 | |
| 431 | If you want to use your own embeddings, use it as follows: |
| 432 | |
| 433 | ```python |
| 434 | from bertopic import BERTopic |
| 435 | from sklearn.datasets import fetch_20newsgroups |
| 436 | from sentence_transformers import SentenceTransformer |
| 437 | |
| 438 | # Create embeddings |
| 439 | docs = fetch_20newsgroups(subset='all')['data'] |
| 440 | sentence_model = SentenceTransformer("all-MiniLM-L6-v2") |
| 441 | embeddings = sentence_model.encode(docs, show_progress_bar=True) |
| 442 | |
| 443 | # Create topic model |
| 444 | topic_model = BERTopic() |
| 445 | topics, probs = topic_model.fit_transform(docs, embeddings) |
| 446 | ``` |
| 447 | """ |
| 448 | if documents is not None: |
| 449 | check_documents_type(documents) |
| 450 | check_embeddings_shape(embeddings, documents) |
| 451 | |
| 452 | doc_ids = range(len(documents)) if documents is not None else range(len(images)) |