| 581 | |
| 582 | |
| 583 | class TFIDFEncoder: |
| 584 | def __init__( |
| 585 | self, |
| 586 | vocab=None, |
| 587 | lowercase=True, |
| 588 | min_count=0, |
| 589 | smooth_idf=True, |
| 590 | max_tokens=None, |
| 591 | input_type="files", |
| 592 | filter_stopwords=True, |
| 593 | filter_punctuation=True, |
| 594 | tokenizer="words", |
| 595 | ): |
| 596 | r""" |
| 597 | An object for compiling and encoding the term-frequency |
| 598 | inverse-document-frequency (TF-IDF) representation of the tokens in a |
| 599 | text corpus. |
| 600 | |
| 601 | Notes |
| 602 | ----- |
| 603 | TF-IDF is intended to reflect how important a word is to a document in |
| 604 | a collection or corpus. For a word token `w` in a document `d`, and a |
| 605 | corpus, :math:`D = \{d_1, \ldots, d_N\}`, we have: |
| 606 | |
| 607 | .. math:: |
| 608 | \text{TF}(w, d) &= \text{num. occurences of }w \text{ in document }d \\ |
| 609 | \text{IDF}(w, D) &= \log \frac{|D|}{|\{ d \in D: t \in d \}|} |
| 610 | |
| 611 | Parameters |
| 612 | ---------- |
| 613 | vocab : :class:`Vocabulary` object or list-like |
| 614 | An existing vocabulary to filter the tokens in the corpus against. |
| 615 | Default is None. |
| 616 | lowercase : bool |
| 617 | Whether to convert each string to lowercase before tokenization. |
| 618 | Default is True. |
| 619 | min_count : int |
| 620 | Minimum number of times a token must occur in order to be included |
| 621 | in vocab. Default is 0. |
| 622 | smooth_idf : bool |
| 623 | Whether to add 1 to the denominator of the IDF calculation to avoid |
| 624 | divide-by-zero errors. Default is True. |
| 625 | max_tokens : int |
| 626 | Only add the `max_tokens` most frequent tokens that occur more |
| 627 | than `min_count` to the vocabulary. If None, add all tokens |
| 628 | greater that occur more than than `min_count`. Default is None. |
| 629 | input_type : {'files', 'strings'} |
| 630 | If 'files', the sequence input to `fit` is expected to be a list |
| 631 | of filepaths. If 'strings', the input is expected to be a list of |
| 632 | lists, each sublist containing the raw strings for a single |
| 633 | document in the corpus. Default is 'filename'. |
| 634 | filter_stopwords : bool |
| 635 | Whether to remove stopwords before encoding the words in the |
| 636 | corpus. Default is True. |
| 637 | filter_punctuation : bool |
| 638 | Whether to remove punctuation before encoding the words in the |
| 639 | corpus. Default is True. |
| 640 | tokenizer : {'whitespace', 'words', 'characters', 'bytes'} |