Fit BERTopic on a subset of the data and perform online learning with batch-like data. Online topic modeling in BERTopic is performed by using dimensionality reduction and cluster algorithms that support a `partial_fit` method in order to incrementally train the topi
(
self,
documents: List[str],
embeddings: np.ndarray = None,
y: Union[List[int], np.ndarray] = None,
)
| 647 | return predictions, probabilities |
| 648 | |
| 649 | def partial_fit( |
| 650 | self, |
| 651 | documents: List[str], |
| 652 | embeddings: np.ndarray = None, |
| 653 | y: Union[List[int], np.ndarray] = None, |
| 654 | ): |
| 655 | """Fit BERTopic on a subset of the data and perform online learning |
| 656 | with batch-like data. |
| 657 | |
| 658 | Online topic modeling in BERTopic is performed by using dimensionality |
| 659 | reduction and cluster algorithms that support a `partial_fit` method |
| 660 | in order to incrementally train the topic model. |
| 661 | |
| 662 | Likewise, the `bertopic.vectorizers.OnlineCountVectorizer` is used |
| 663 | to dynamically update its vocabulary when presented with new data. |
| 664 | It has several parameters for modeling decay and updating the |
| 665 | representations. |
| 666 | |
| 667 | In other words, although the main algorithm stays the same, the training |
| 668 | procedure now works as follows: |
| 669 | |
| 670 | For each subset of the data: |
| 671 | |
| 672 | 1. Generate embeddings with a pre-trained language model |
| 673 | 2. Incrementally update the dimensionality reduction algorithm with `partial_fit` |
| 674 | 3. Incrementally update the cluster algorithm with `partial_fit` |
| 675 | 4. Incrementally update the OnlineCountVectorizer and apply some form of decay |
| 676 | |
| 677 | Note that it is advised to use `partial_fit` with batches and |
| 678 | not single documents for the best performance. |
| 679 | |
| 680 | Arguments: |
| 681 | documents: A list of documents to fit on |
| 682 | embeddings: Pre-trained document embeddings. These can be used |
| 683 | instead of the sentence-transformer model |
| 684 | y: The target class for (semi)-supervised modeling. Use -1 if no class for a |
| 685 | specific instance is specified. |
| 686 | |
| 687 | Examples: |
| 688 | ```python |
| 689 | from sklearn.datasets import fetch_20newsgroups |
| 690 | from sklearn.cluster import MiniBatchKMeans |
| 691 | from sklearn.decomposition import IncrementalPCA |
| 692 | from bertopic.vectorizers import OnlineCountVectorizer |
| 693 | from bertopic import BERTopic |
| 694 | |
| 695 | # Prepare documents |
| 696 | docs = fetch_20newsgroups(subset=subset, remove=('headers', 'footers', 'quotes'))["data"] |
| 697 | |
| 698 | # Prepare sub-models that support online learning |
| 699 | umap_model = IncrementalPCA(n_components=5) |
| 700 | cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) |
| 701 | vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01) |
| 702 | |
| 703 | topic_model = BERTopic(umap_model=umap_model, |
| 704 | hdbscan_model=cluster_model, |
| 705 | vectorizer_model=vectorizer_model) |
| 706 |