hub / github.com/openai/tiktoken / encode

Method encode

tiktoken/core.py:82–136 · view source on GitHub ↗

Encodes a string into tokens. Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don't

(
        self,
        text: str,
        *,
        allowed_special: Literal["all"] | AbstractSet[str] = set(),  # noqa: B006
        disallowed_special: Literal["all"] | Collection[str] = "all",
    )

Source from the content-addressed store, hash-verified

80	return self._core_bpe.encode_ordinary(text)
81
82	def encode(
83	self,
84	text: str,
85	*,
86	allowed_special: Literal["all"] \| AbstractSet[str] = set(), # noqa: B006
87	disallowed_special: Literal["all"] \| Collection[str] = "all",
88	) -> list[int]:
89	"""Encodes a string into tokens.
90
91	Special tokens are artificial tokens used to unlock capabilities from a model,
92	such as fill-in-the-middle. So we want to be careful about accidentally encoding special
93	tokens, since they can be used to trick a model into doing something we don't want it to do.
94
95	Hence, by default, encode will raise an error if it encounters text that corresponds
96	to a special token. This can be controlled on a per-token level using the `allowed_special`
97	and `disallowed_special` parameters. In particular:
98	- Setting `disallowed_special` to () will prevent this function from raising errors and
99	cause all text corresponding to special tokens to be encoded as natural text.
100	- Setting `allowed_special` to "all" will cause this function to treat all text
101	corresponding to special tokens to be encoded as special tokens.
102
103	```
104	>>> enc.encode("hello world")
105	[31373, 995]
106	>>> enc.encode("<\|endoftext\|>", allowed_special={"<\|endoftext\|>"})
107	[50256]
108	>>> enc.encode("<\|endoftext\|>", allowed_special="all")
109	[50256]
110	>>> enc.encode("<\|endoftext\|>")
111	# Raises ValueError
112	>>> enc.encode("<\|endoftext\|>", disallowed_special=())
113	[27, 91, 437, 1659, 5239, 91, 29]
114	```
115	"""
116	if allowed_special == "all":
117	allowed_special = self.special_tokens_set
118	if disallowed_special == "all":
119	disallowed_special = self.special_tokens_set - allowed_special
120	if disallowed_special:
121	if not isinstance(disallowed_special, frozenset):
122	disallowed_special = frozenset(disallowed_special)
123	if match := _special_token_regex(disallowed_special).search(text):
124	raise_disallowed_special_token(match.group())
125
126	try:
127	return self._core_bpe.encode(text, allowed_special)
128	except UnicodeEncodeError:
129	# BPE operates on bytes, but the regex operates on unicode. If we pass a str that is
130	# invalid UTF-8 to Rust, it will rightfully complain. Here we do a quick and dirty
131	# fixup for any surrogate pairs that may have sneaked their way into the text.
132	# Technically, this introduces a place where encode + decode doesn't roundtrip a Python
133	# string, but given that this is input we want to support, maybe that's okay.
134	# Also we use errors="replace" to handle weird things like lone surrogates.
135	text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
136	return self._core_bpe.encode(text, allowed_special)
137
138	def encode_to_numpy(
139	self,

Callers 15

test_pickleFunction · 0.95

benchmark_batchFunction · 0.45

test_hyp_offsetsFunction · 0.45

test_basic_offsetsFunction · 0.45

test_simpleFunction · 0.45

test_simple_repeatedFunction · 0.45

test_large_repeatedFunction · 0.45

test_simple_regexFunction · 0.45

test_basic_encodeFunction · 0.45

test_encode_emptyFunction · 0.45

test_encode_surrogate_pairsFunction · 0.45

test_catastrophically_repetitiveFunction · 0.45

Calls 3

_special_token_regexFunction · 0.85

raise_disallowed_special_tokenFunction · 0.85

decodeMethod · 0.45

Tested by 15

test_pickleFunction · 0.76

test_hyp_offsetsFunction · 0.36

test_basic_offsetsFunction · 0.36

test_simpleFunction · 0.36

test_simple_repeatedFunction · 0.36

test_large_repeatedFunction · 0.36

test_simple_regexFunction · 0.36

test_basic_encodeFunction · 0.36

test_encode_emptyFunction · 0.36

test_encode_surrogate_pairsFunction · 0.36

test_catastrophically_repetitiveFunction · 0.36

test_basic_roundtripFunction · 0.36