hub / github.com/MaartenGr/BERTopic / ClassTfidfTransformer

Class ClassTfidfTransformer

bertopic/vectorizers/_ctfidf.py:9–115 · view source on GitHub ↗

A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base. ![](../algorithm/c-TF-IDF.svg) c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document inste

Source from the content-addressed store, hash-verified

7
8
9	class ClassTfidfTransformer(TfidfTransformer):
10	"""A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.
11
12	![](../algorithm/c-TF-IDF.svg)
13
14	c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes
15	by joining all documents per class. Thus, each class is converted to a single document
16	instead of set of documents. The frequency of each word x is extracted
17	for each class c and is l1 normalized. This constitutes the term frequency.
18
19	Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus
20	the average number of words per class A divided by the frequency of word x
21	across all classes.
22
23	Arguments:
24	bm25_weighting: Uses BM25-inspired idf-weighting procedure instead of the procedure
25	as defined in the c-TF-IDF formula. It uses the following weighting scheme:
26	`log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))`
27	reduce_frequent_words: Takes the square root of the bag-of-words after normalizing the matrix.
28	Helps to reduce the impact of words that appear too frequently.
29	seed_words: Specific words that will have their idf value increased by
30	the value of `seed_multiplier`.
31	NOTE: This will only increase the value of words that have an exact match.
32	seed_multiplier: The value with which the idf values of the words in `seed_words`
33	are multiplied.
34
35	Examples:
36	```python
37	transformer = ClassTfidfTransformer()
38	```
39	"""
40
41	def __init__(
42	self,
43	bm25_weighting: bool = False,
44	reduce_frequent_words: bool = False,
45	seed_words: List[str] \| None = None,
46	seed_multiplier: float = 2,
47	):
48	self.bm25_weighting = bm25_weighting
49	self.reduce_frequent_words = reduce_frequent_words
50	self.seed_words = seed_words
51	self.seed_multiplier = seed_multiplier
52	super(ClassTfidfTransformer, self).__init__()
53
54	def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
55	"""Learn the idf vector (global term weights).
56
57	Arguments:
58	X: A matrix of term/token counts.
59	multiplier: A multiplier for increasing/decreasing certain IDF scores
60	"""
61	X = check_array(X, accept_sparse=("csr", "csc"))
62	if not sp.issparse(X):
63	X = sp.csr_matrix(X)
64	dtype = np.float64
65
66	if self.use_idf:

Callers 4

test_ctfidfFunction · 0.90

test_ctfidf_custom_cvFunction · 0.90

__init__Method · 0.90

update_topicsMethod · 0.90

Calls

no outgoing calls

Tested by 2

test_ctfidfFunction · 0.72

test_ctfidf_custom_cvFunction · 0.72