Convert the characters in `line` to a collection of bytes. Each byte is represented in decimal as an integer between 0 and 255. Parameters ---------- line : str The string to tokenize. encoding : str The encoding scheme for the characters in `line`. Default
(line, encoding="utf-8", splitter=None, **kwargs)
| 110 | |
| 111 | |
| 112 | def tokenize_bytes_raw(line, encoding="utf-8", splitter=None, **kwargs): |
| 113 | """ |
| 114 | Convert the characters in `line` to a collection of bytes. Each byte is |
| 115 | represented in decimal as an integer between 0 and 255. |
| 116 | |
| 117 | Parameters |
| 118 | ---------- |
| 119 | line : str |
| 120 | The string to tokenize. |
| 121 | encoding : str |
| 122 | The encoding scheme for the characters in `line`. Default is `'utf-8'`. |
| 123 | splitter : {'punctuation', None} |
| 124 | If `'punctuation'`, split the string at any punctuation character |
| 125 | before encoding into bytes. If None, do not split `line` at all. |
| 126 | Default is None. |
| 127 | |
| 128 | Returns |
| 129 | ------- |
| 130 | bytes : list |
| 131 | A list of the byte-encoded characters in `line`. Each item in the list |
| 132 | is a string of space-separated integers between 0 and 255 representing |
| 133 | the bytes encoding the characters in `line`. |
| 134 | """ |
| 135 | byte_str = [" ".join([str(i) for i in line.encode(encoding)])] |
| 136 | if splitter == "punctuation": |
| 137 | byte_str = _PUNC_BYTE_REGEX.sub(r"-\1-", byte_str[0]).split("-") |
| 138 | return byte_str |
| 139 | |
| 140 | |
| 141 | def bytes_to_chars(byte_list, encoding="utf-8"): |