MCPcopy
hub / github.com/openai/tiktoken / decode_with_offsets

Method decode_with_offsets

tiktoken/core.py:312–335  ·  view source on GitHub ↗

Decodes a list of tokens into a string and a list of offsets. Each offset is the index into text corresponding to the start of each token. If UTF-8 character boundaries do not line up with token boundaries, the offset is the index of the first character that contains bytes f

(self, tokens: Sequence[int])

Source from the content-addressed store, hash-verified

310 return [self.decode_single_token_bytes(token) for token in tokens]
311
312 def decode_with_offsets(self, tokens: Sequence[int]) -> tuple[str, list[int]]:
313 """Decodes a list of tokens into a string and a list of offsets.
314
315 Each offset is the index into text corresponding to the start of each token.
316 If UTF-8 character boundaries do not line up with token boundaries, the offset is the index
317 of the first character that contains bytes from the token.
318
319 This will currently raise if given tokens that decode to invalid UTF-8; this behaviour may
320 change in the future to be more permissive.
321
322 >>> enc.decode_with_offsets([31373, 995])
323 ('hello world', [0, 5])
324 """
325 token_bytes = self.decode_tokens_bytes(tokens)
326
327 text_len = 0
328 offsets = []
329 for token in token_bytes:
330 offsets.append(max(0, text_len - (0x80 <= token[0] < 0xC0)))
331 text_len += sum(1 for c in token if not 0x80 <= c < 0xC0)
332
333 # TODO: assess correctness for errors="ignore" and errors="replace"
334 text = b"".join(token_bytes).decode("utf-8", errors="strict")
335 return text, offsets
336
337 def decode_batch(
338 self, batch: Sequence[Sequence[int]], *, errors: str = "replace", num_threads: int = 8

Callers 2

test_hyp_offsetsFunction · 0.80
test_basic_offsetsFunction · 0.80

Calls 2

decode_tokens_bytesMethod · 0.95
decodeMethod · 0.45

Tested by 2

test_hyp_offsetsFunction · 0.64
test_basic_offsetsFunction · 0.64