hub / github.com/ddbourgin/numpy-ml / tokenize_bytes_raw

Function tokenize_bytes_raw

numpy_ml/preprocessing/nlp.py:112–138 · view source on GitHub ↗

Convert the characters in `line` to a collection of bytes. Each byte is represented in decimal as an integer between 0 and 255. Parameters ---------- line : str The string to tokenize. encoding : str The encoding scheme for the characters in `line`. Default

(line, encoding="utf-8", splitter=None, **kwargs)

Source from the content-addressed store, hash-verified

110
111
112	def tokenize_bytes_raw(line, encoding="utf-8", splitter=None, **kwargs):
113	"""
114	Convert the characters in `line` to a collection of bytes. Each byte is
115	represented in decimal as an integer between 0 and 255.
116
117	Parameters
118	----------
119	line : str
120	The string to tokenize.
121	encoding : str
122	The encoding scheme for the characters in `line`. Default is `'utf-8'`.
123	splitter : {'punctuation', None}
124	If `'punctuation'`, split the string at any punctuation character
125	before encoding into bytes. If None, do not split `line` at all.
126	Default is None.
127
128	Returns
129	-------
130	bytes : list
131	A list of the byte-encoded characters in `line`. Each item in the list
132	is a string of space-separated integers between 0 and 255 representing
133	the bytes encoding the characters in `line`.
134	"""
135	byte_str = [" ".join([str(i) for i in line.encode(encoding)])]
136	if splitter == "punctuation":
137	byte_str = _PUNC_BYTE_REGEX.sub(r"-\1-", byte_str[0]).split("-")
138	return byte_str
139
140
141	def bytes_to_chars(byte_list, encoding="utf-8"):

Callers 1

_transformMethod · 0.85

Calls 1

encodeMethod · 0.80

Tested by

no test coverage detected