MCPcopy Index your code
hub / github.com/ddbourgin/numpy-ml / TFIDFEncoder

Class TFIDFEncoder

numpy_ml/preprocessing/nlp.py:583–1005  ·  view source on GitHub ↗

Source from the content-addressed store, hash-verified

581
582
583class TFIDFEncoder:
584 def __init__(
585 self,
586 vocab=None,
587 lowercase=True,
588 min_count=0,
589 smooth_idf=True,
590 max_tokens=None,
591 input_type="files",
592 filter_stopwords=True,
593 filter_punctuation=True,
594 tokenizer="words",
595 ):
596 r"""
597 An object for compiling and encoding the term-frequency
598 inverse-document-frequency (TF-IDF) representation of the tokens in a
599 text corpus.
600
601 Notes
602 -----
603 TF-IDF is intended to reflect how important a word is to a document in
604 a collection or corpus. For a word token `w` in a document `d`, and a
605 corpus, :math:`D = \{d_1, \ldots, d_N\}`, we have:
606
607 .. math::
608 \text{TF}(w, d) &= \text{num. occurences of }w \text{ in document }d \\
609 \text{IDF}(w, D) &= \log \frac{|D|}{|\{ d \in D: t \in d \}|}
610
611 Parameters
612 ----------
613 vocab : :class:`Vocabulary` object or list-like
614 An existing vocabulary to filter the tokens in the corpus against.
615 Default is None.
616 lowercase : bool
617 Whether to convert each string to lowercase before tokenization.
618 Default is True.
619 min_count : int
620 Minimum number of times a token must occur in order to be included
621 in vocab. Default is 0.
622 smooth_idf : bool
623 Whether to add 1 to the denominator of the IDF calculation to avoid
624 divide-by-zero errors. Default is True.
625 max_tokens : int
626 Only add the `max_tokens` most frequent tokens that occur more
627 than `min_count` to the vocabulary. If None, add all tokens
628 greater that occur more than than `min_count`. Default is None.
629 input_type : {'files', 'strings'}
630 If 'files', the sequence input to `fit` is expected to be a list
631 of filepaths. If 'strings', the input is expected to be a list of
632 lists, each sublist containing the raw strings for a single
633 document in the corpus. Default is 'filename'.
634 filter_stopwords : bool
635 Whether to remove stopwords before encoding the words in the
636 corpus. Default is True.
637 filter_punctuation : bool
638 Whether to remove punctuation before encoding the words in the
639 corpus. Default is True.
640 tokenizer : {'whitespace', 'words', 'characters', 'bytes'}

Callers 1

test_tfidfFunction · 0.90

Calls

no outgoing calls

Tested by 1

test_tfidfFunction · 0.72