MCPcopy Index your code
hub / github.com/ddbourgin/numpy-ml / tokenize_bytes_raw

Function tokenize_bytes_raw

numpy_ml/preprocessing/nlp.py:112–138  ·  view source on GitHub ↗

Convert the characters in `line` to a collection of bytes. Each byte is represented in decimal as an integer between 0 and 255. Parameters ---------- line : str The string to tokenize. encoding : str The encoding scheme for the characters in `line`. Default

(line, encoding="utf-8", splitter=None, **kwargs)

Source from the content-addressed store, hash-verified

110
111
112def tokenize_bytes_raw(line, encoding="utf-8", splitter=None, **kwargs):
113 """
114 Convert the characters in `line` to a collection of bytes. Each byte is
115 represented in decimal as an integer between 0 and 255.
116
117 Parameters
118 ----------
119 line : str
120 The string to tokenize.
121 encoding : str
122 The encoding scheme for the characters in `line`. Default is `'utf-8'`.
123 splitter : {'punctuation', None}
124 If `'punctuation'`, split the string at any punctuation character
125 before encoding into bytes. If None, do not split `line` at all.
126 Default is None.
127
128 Returns
129 -------
130 bytes : list
131 A list of the byte-encoded characters in `line`. Each item in the list
132 is a string of space-separated integers between 0 and 255 representing
133 the bytes encoding the characters in `line`.
134 """
135 byte_str = [" ".join([str(i) for i in line.encode(encoding)])]
136 if splitter == "punctuation":
137 byte_str = _PUNC_BYTE_REGEX.sub(r"-\1-", byte_str[0]).split("-")
138 return byte_str
139
140
141def bytes_to_chars(byte_list, encoding="utf-8"):

Callers 1

_transformMethod · 0.85

Calls 1

encodeMethod · 0.80

Tested by

no test coverage detected