MCPcopy
hub / github.com/ddbourgin/numpy-ml / tokenize_words

Function tokenize_words

numpy_ml/preprocessing/nlp.py:77–86  ·  view source on GitHub ↗

Split a string into individual words, optionally removing punctuation and stop-words in the process.

(
    line, lowercase=True, filter_stopwords=True, filter_punctuation=True, **kwargs,
)

Source from the content-addressed store, hash-verified

75
76
77def tokenize_words(
78 line, lowercase=True, filter_stopwords=True, filter_punctuation=True, **kwargs,
79):
80 """
81 Split a string into individual words, optionally removing punctuation and
82 stop-words in the process.
83 """
84 REGEX = _WORD_REGEX if filter_punctuation else _WORD_REGEX_W_PUNC
85 words = REGEX.findall(line.lower() if lowercase else line)
86 return remove_stop_words(words) if filter_stopwords else words
87
88
89def tokenize_words_bytes(

Callers 5

_trainMethod · 0.90
minibatcherMethod · 0.85
tokenize_words_bytesFunction · 0.85
trainMethod · 0.85
trainMethod · 0.85

Calls 1

remove_stop_wordsFunction · 0.85

Tested by 2

trainMethod · 0.68
trainMethod · 0.68