MCPcopy Index your code
hub / github.com/ddbourgin/numpy-ml / Vocabulary

Class Vocabulary

numpy_ml/preprocessing/nlp.py:1008–1325  ·  view source on GitHub ↗

Source from the content-addressed store, hash-verified

1006
1007
1008class Vocabulary:
1009 def __init__(
1010 self,
1011 lowercase=True,
1012 min_count=None,
1013 max_tokens=None,
1014 filter_stopwords=True,
1015 filter_punctuation=True,
1016 tokenizer="words",
1017 ):
1018 """
1019 An object for compiling and encoding the unique tokens in a text corpus.
1020
1021 Parameters
1022 ----------
1023 lowercase : bool
1024 Whether to convert each string to lowercase before tokenization.
1025 Default is True.
1026 min_count : int
1027 Minimum number of times a token must occur in order to be included
1028 in vocab. If `None`, include all tokens from `corpus_fp` in vocab.
1029 Default is None.
1030 max_tokens : int
1031 Only add the `max_tokens` most frequent tokens that occur more
1032 than `min_count` to the vocabulary. If None, add all tokens
1033 that occur more than than `min_count`. Default is None.
1034 filter_stopwords : bool
1035 Whether to remove stopwords before encoding the words in the
1036 corpus. Default is True.
1037 filter_punctuation : bool
1038 Whether to remove punctuation before encoding the words in the
1039 corpus. Default is True.
1040 tokenizer : {'whitespace', 'words', 'characters', 'bytes'}
1041 Strategy to follow when mapping strings to tokens. The
1042 `'whitespace'` tokenizer splits strings at whitespace characters.
1043 The `'words'` tokenizer splits strings using a "word" regex. The
1044 `'characters'` tokenizer splits strings into individual characters.
1045 The `'bytes'` tokenizer splits strings into a collection of
1046 individual bytes.
1047 """
1048 self.hyperparameters = {
1049 "id": "Vocabulary",
1050 "encoding": None,
1051 "corpus_fps": None,
1052 "lowercase": lowercase,
1053 "min_count": min_count,
1054 "max_tokens": max_tokens,
1055 "filter_stopwords": filter_stopwords,
1056 "filter_punctuation": filter_punctuation,
1057 "tokenizer": tokenizer,
1058 }
1059
1060 def __len__(self):
1061 """Return the number of tokens in the vocabulary"""
1062 return len(self._tokens)
1063
1064 def __iter__(self):
1065 """Return an iterator over the tokens in the vocabulary"""

Callers 2

fitMethod · 0.85
fitMethod · 0.85

Calls

no outgoing calls

Tested by

no test coverage detected