An online variant of the CountVectorizer with updating vocabulary. At each `.partial_fit`, its vocabulary is updated based on any OOV words it might find. Then, `.update_bow` can be used to track and update the Bag-of-Words representation. These functions are separated such that the
| 9 | |
| 10 | |
| 11 | class OnlineCountVectorizer(CountVectorizer): |
| 12 | """An online variant of the CountVectorizer with updating vocabulary. |
| 13 | |
| 14 | At each `.partial_fit`, its vocabulary is updated based on any OOV words |
| 15 | it might find. Then, `.update_bow` can be used to track and update |
| 16 | the Bag-of-Words representation. These functions are separated such that |
| 17 | the vectorizer can be used in iteration without updating the Bag-of-Words |
| 18 | representation can might speed up the fitting process. However, the |
| 19 | `.update_bow` function is used in BERTopic to track changes in the |
| 20 | topic representations and allow for decay. |
| 21 | |
| 22 | This class inherits its parameters and attributes from: |
| 23 | `sklearn.feature_extraction.text.CountVectorizer` |
| 24 | |
| 25 | Arguments: |
| 26 | decay: A value between [0, 1] to weight the percentage of frequencies |
| 27 | the previous bag-of-words should be decreased. For example, |
| 28 | a value of `.1` will decrease the frequencies in the bag-of-words |
| 29 | matrix with 10% at each iteration. |
| 30 | delete_min_df: Delete words at each iteration from its vocabulary |
| 31 | that are below a minimum frequency. |
| 32 | This will keep the resulting bag-of-words matrix small |
| 33 | such that it does not explode in size with increasing |
| 34 | vocabulary. If `decay` is None then this equals `min_df`. |
| 35 | **kwargs: Set of parameters inherited from: |
| 36 | `sklearn.feature_extraction.text.CountVectorizer` |
| 37 | In practice, this means that you can still use parameters |
| 38 | from the original CountVectorizer, like `stop_words` and |
| 39 | `ngram_range`. |
| 40 | |
| 41 | Attributes: |
| 42 | X_ (scipy.sparse.csr_matrix) : The Bag-of-Words representation |
| 43 | |
| 44 | Examples: |
| 45 | ```python |
| 46 | from bertopic.vectorizers import OnlineCountVectorizer |
| 47 | vectorizer = OnlineCountVectorizer(stop_words="english") |
| 48 | |
| 49 | for index, doc in enumerate(my_docs): |
| 50 | vectorizer.partial_fit(doc) |
| 51 | |
| 52 | # Update and clean the bow every 100 iterations: |
| 53 | if index % 100 == 0: |
| 54 | X = vectorizer.update_bow() |
| 55 | ``` |
| 56 | |
| 57 | To use the model in BERTopic: |
| 58 | |
| 59 | ```python |
| 60 | from bertopic import BERTopic |
| 61 | from bertopic.vectorizers import OnlineCountVectorizer |
| 62 | |
| 63 | vectorizer_model = OnlineCountVectorizer(stop_words="english") |
| 64 | topic_model = BERTopic(vectorizer_model=vectorizer_model) |
| 65 | ``` |
| 66 | |
| 67 | References: |
| 68 | Adapted from: https://github.com/idoshlomo/online_vectorizers |
no outgoing calls