A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.  c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document inste
| 7 | |
| 8 | |
| 9 | class ClassTfidfTransformer(TfidfTransformer): |
| 10 | """A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base. |
| 11 | |
| 12 |  |
| 13 | |
| 14 | c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes |
| 15 | by joining all documents per class. Thus, each class is converted to a single document |
| 16 | instead of set of documents. The frequency of each word **x** is extracted |
| 17 | for each class **c** and is **l1** normalized. This constitutes the term frequency. |
| 18 | |
| 19 | Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus |
| 20 | the average number of words per class **A** divided by the frequency of word **x** |
| 21 | across all classes. |
| 22 | |
| 23 | Arguments: |
| 24 | bm25_weighting: Uses BM25-inspired idf-weighting procedure instead of the procedure |
| 25 | as defined in the c-TF-IDF formula. It uses the following weighting scheme: |
| 26 | `log(1+((avg_nr_samples - df + 0.5) / (df+0.5)))` |
| 27 | reduce_frequent_words: Takes the square root of the bag-of-words after normalizing the matrix. |
| 28 | Helps to reduce the impact of words that appear too frequently. |
| 29 | seed_words: Specific words that will have their idf value increased by |
| 30 | the value of `seed_multiplier`. |
| 31 | NOTE: This will only increase the value of words that have an exact match. |
| 32 | seed_multiplier: The value with which the idf values of the words in `seed_words` |
| 33 | are multiplied. |
| 34 | |
| 35 | Examples: |
| 36 | ```python |
| 37 | transformer = ClassTfidfTransformer() |
| 38 | ``` |
| 39 | """ |
| 40 | |
| 41 | def __init__( |
| 42 | self, |
| 43 | bm25_weighting: bool = False, |
| 44 | reduce_frequent_words: bool = False, |
| 45 | seed_words: List[str] | None = None, |
| 46 | seed_multiplier: float = 2, |
| 47 | ): |
| 48 | self.bm25_weighting = bm25_weighting |
| 49 | self.reduce_frequent_words = reduce_frequent_words |
| 50 | self.seed_words = seed_words |
| 51 | self.seed_multiplier = seed_multiplier |
| 52 | super(ClassTfidfTransformer, self).__init__() |
| 53 | |
| 54 | def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None): |
| 55 | """Learn the idf vector (global term weights). |
| 56 | |
| 57 | Arguments: |
| 58 | X: A matrix of term/token counts. |
| 59 | multiplier: A multiplier for increasing/decreasing certain IDF scores |
| 60 | """ |
| 61 | X = check_array(X, accept_sparse=("csr", "csc")) |
| 62 | if not sp.issparse(X): |
| 63 | X = sp.csr_matrix(X) |
| 64 | dtype = np.float64 |
| 65 | |
| 66 | if self.use_idf: |
no outgoing calls