MCPcopy
hub / github.com/MaartenGr/BERTopic / OnlineCountVectorizer

Class OnlineCountVectorizer

bertopic/vectorizers/_online_cv.py:11–158  ·  view source on GitHub ↗

An online variant of the CountVectorizer with updating vocabulary. At each `.partial_fit`, its vocabulary is updated based on any OOV words it might find. Then, `.update_bow` can be used to track and update the Bag-of-Words representation. These functions are separated such that the

Source from the content-addressed store, hash-verified

9
10
11class OnlineCountVectorizer(CountVectorizer):
12 """An online variant of the CountVectorizer with updating vocabulary.
13
14 At each `.partial_fit`, its vocabulary is updated based on any OOV words
15 it might find. Then, `.update_bow` can be used to track and update
16 the Bag-of-Words representation. These functions are separated such that
17 the vectorizer can be used in iteration without updating the Bag-of-Words
18 representation can might speed up the fitting process. However, the
19 `.update_bow` function is used in BERTopic to track changes in the
20 topic representations and allow for decay.
21
22 This class inherits its parameters and attributes from:
23 `sklearn.feature_extraction.text.CountVectorizer`
24
25 Arguments:
26 decay: A value between [0, 1] to weight the percentage of frequencies
27 the previous bag-of-words should be decreased. For example,
28 a value of `.1` will decrease the frequencies in the bag-of-words
29 matrix with 10% at each iteration.
30 delete_min_df: Delete words at each iteration from its vocabulary
31 that are below a minimum frequency.
32 This will keep the resulting bag-of-words matrix small
33 such that it does not explode in size with increasing
34 vocabulary. If `decay` is None then this equals `min_df`.
35 **kwargs: Set of parameters inherited from:
36 `sklearn.feature_extraction.text.CountVectorizer`
37 In practice, this means that you can still use parameters
38 from the original CountVectorizer, like `stop_words` and
39 `ngram_range`.
40
41 Attributes:
42 X_ (scipy.sparse.csr_matrix) : The Bag-of-Words representation
43
44 Examples:
45 ```python
46 from bertopic.vectorizers import OnlineCountVectorizer
47 vectorizer = OnlineCountVectorizer(stop_words="english")
48
49 for index, doc in enumerate(my_docs):
50 vectorizer.partial_fit(doc)
51
52 # Update and clean the bow every 100 iterations:
53 if index % 100 == 0:
54 X = vectorizer.update_bow()
55 ```
56
57 To use the model in BERTopic:
58
59 ```python
60 from bertopic import BERTopic
61 from bertopic.vectorizers import OnlineCountVectorizer
62
63 vectorizer_model = OnlineCountVectorizer(stop_words="english")
64 topic_model = BERTopic(vectorizer_model=vectorizer_model)
65 ```
66
67 References:
68 Adapted from: https://github.com/idoshlomo/online_vectorizers

Callers 2

online_topic_modelFunction · 0.90
test_online_cvFunction · 0.90

Calls

no outgoing calls

Tested by 2

online_topic_modelFunction · 0.72
test_online_cvFunction · 0.72