MCPcopy
hub / github.com/MaartenGr/BERTopic / ClassTfidfTransformer

Class ClassTfidfTransformer

bertopic/vectorizers/_ctfidf.py:9–115  ·  view source on GitHub ↗

A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base. ![](../algorithm/c-TF-IDF.svg) c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document inste

Source from the content-addressed store, hash-verified

7
8
9class ClassTfidfTransformer(TfidfTransformer):
10 """A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.
11
12 ![](../algorithm/c-TF-IDF.svg)
13
14 c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes
15 by joining all documents per class. Thus, each class is converted to a single document
16 instead of set of documents. The frequency of each word **x** is extracted
17 for each class **c** and is **l1** normalized. This constitutes the term frequency.
18
19 Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus
20 the average number of words per class **A** divided by the frequency of word **x**
21 across all classes.
22
23 Arguments:
24 bm25_weighting: Uses BM25-inspired idf-weighting procedure instead of the procedure
25 as defined in the c-TF-IDF formula. It uses the following weighting scheme:
26 `log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))`
27 reduce_frequent_words: Takes the square root of the bag-of-words after normalizing the matrix.
28 Helps to reduce the impact of words that appear too frequently.
29 seed_words: Specific words that will have their idf value increased by
30 the value of `seed_multiplier`.
31 NOTE: This will only increase the value of words that have an exact match.
32 seed_multiplier: The value with which the idf values of the words in `seed_words`
33 are multiplied.
34
35 Examples:
36 ```python
37 transformer = ClassTfidfTransformer()
38 ```
39 """
40
41 def __init__(
42 self,
43 bm25_weighting: bool = False,
44 reduce_frequent_words: bool = False,
45 seed_words: List[str] | None = None,
46 seed_multiplier: float = 2,
47 ):
48 self.bm25_weighting = bm25_weighting
49 self.reduce_frequent_words = reduce_frequent_words
50 self.seed_words = seed_words
51 self.seed_multiplier = seed_multiplier
52 super(ClassTfidfTransformer, self).__init__()
53
54 def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
55 """Learn the idf vector (global term weights).
56
57 Arguments:
58 X: A matrix of term/token counts.
59 multiplier: A multiplier for increasing/decreasing certain IDF scores
60 """
61 X = check_array(X, accept_sparse=("csr", "csc"))
62 if not sp.issparse(X):
63 X = sp.csr_matrix(X)
64 dtype = np.float64
65
66 if self.use_idf:

Callers 4

test_ctfidfFunction · 0.90
test_ctfidf_custom_cvFunction · 0.90
__init__Method · 0.90
update_topicsMethod · 0.90

Calls

no outgoing calls

Tested by 2

test_ctfidfFunction · 0.72
test_ctfidf_custom_cvFunction · 0.72