Compute the (smoothed-) inverse-document frequency for each token in the corpus. For a word token `w`, the IDF is simply IDF(w) = log ( |D| / |{ d in D: w in d }| ) + 1 where D is the set of all documents in the corpus, D = {d1, d2, ..., d
(self)
| 932 | self.vocab_counts = Counter({t.word: t.count for t in self._tokens}) |
| 933 | |
| 934 | def _calc_idf(self): |
| 935 | """ |
| 936 | Compute the (smoothed-) inverse-document frequency for each token in |
| 937 | the corpus. |
| 938 | |
| 939 | For a word token `w`, the IDF is simply |
| 940 | |
| 941 | IDF(w) = log ( |D| / |{ d in D: w in d }| ) + 1 |
| 942 | |
| 943 | where D is the set of all documents in the corpus, |
| 944 | |
| 945 | D = {d1, d2, ..., dD} |
| 946 | |
| 947 | If `smooth_idf` is True, we perform additive smoothing on the number of |
| 948 | documents containing a given word, equivalent to pretending that there |
| 949 | exists a final D+1st document that contains every word in the corpus: |
| 950 | |
| 951 | SmoothedIDF(w) = log ( |D| + 1 / [1 + |{ d in D: w in d }|] ) + 1 |
| 952 | """ |
| 953 | inv_doc_freq = {} |
| 954 | smooth_idf = self.hyperparameters["smooth_idf"] |
| 955 | tf, doc_idxs = self.term_freq, self._idx2doc.keys() |
| 956 | |
| 957 | D = len(self._idx2doc) + int(smooth_idf) |
| 958 | for word, w_ix in self.token2idx.items(): |
| 959 | d_count = int(smooth_idf) |
| 960 | d_count += np.sum([1 if w_ix in tf[d_ix] else 0 for d_ix in doc_idxs]) |
| 961 | inv_doc_freq[w_ix] = 1 if d_count == 0 else np.log(D / d_count) + 1 |
| 962 | self.inv_doc_freq = inv_doc_freq |
| 963 | |
| 964 | def transform(self, ignore_special_chars=True): |
| 965 | """ |