hub / github.com/ddbourgin/numpy-ml / TFIDFEncoder

Class TFIDFEncoder

numpy_ml/preprocessing/nlp.py:583–1005 · view source on GitHub ↗

Source from the content-addressed store, hash-verified

581
582
583	class TFIDFEncoder:
584	def __init__(
585	self,
586	vocab=None,
587	lowercase=True,
588	min_count=0,
589	smooth_idf=True,
590	max_tokens=None,
591	input_type="files",
592	filter_stopwords=True,
593	filter_punctuation=True,
594	tokenizer="words",
595	):
596	r"""
597	An object for compiling and encoding the term-frequency
598	inverse-document-frequency (TF-IDF) representation of the tokens in a
599	text corpus.
600
601	Notes
602	-----
603	TF-IDF is intended to reflect how important a word is to a document in
604	a collection or corpus. For a word token `w` in a document `d`, and a
605	corpus, :math:`D = \{d_1, \ldots, d_N\}`, we have:
606
607	.. math::
608	\text{TF}(w, d) &= \text{num. occurences of }w \text{ in document }d \\
609	\text{IDF}(w, D) &= \log \frac{\|D\|}{\|\{ d \in D: t \in d \}\|}
610
611	Parameters
612	----------
613	vocab : :class:`Vocabulary` object or list-like
614	An existing vocabulary to filter the tokens in the corpus against.
615	Default is None.
616	lowercase : bool
617	Whether to convert each string to lowercase before tokenization.
618	Default is True.
619	min_count : int
620	Minimum number of times a token must occur in order to be included
621	in vocab. Default is 0.
622	smooth_idf : bool
623	Whether to add 1 to the denominator of the IDF calculation to avoid
624	divide-by-zero errors. Default is True.
625	max_tokens : int
626	Only add the `max_tokens` most frequent tokens that occur more
627	than `min_count` to the vocabulary. If None, add all tokens
628	greater that occur more than than `min_count`. Default is None.
629	input_type : {'files', 'strings'}
630	If 'files', the sequence input to `fit` is expected to be a list
631	of filepaths. If 'strings', the input is expected to be a list of
632	lists, each sublist containing the raw strings for a single
633	document in the corpus. Default is 'filename'.
634	filter_stopwords : bool
635	Whether to remove stopwords before encoding the words in the
636	corpus. Default is True.
637	filter_punctuation : bool
638	Whether to remove punctuation before encoding the words in the
639	corpus. Default is True.
640	tokenizer : {'whitespace', 'words', 'characters', 'bytes'}

Callers 1

test_tfidfFunction · 0.90

Calls

no outgoing calls

Tested by 1

test_tfidfFunction · 0.72