MCPcopy
hub / github.com/openai/tiktoken / encode

Method encode

tiktoken/core.py:82–136  ·  view source on GitHub ↗

Encodes a string into tokens. Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don't

(
        self,
        text: str,
        *,
        allowed_special: Literal["all"] | AbstractSet[str] = set(),  # noqa: B006
        disallowed_special: Literal["all"] | Collection[str] = "all",
    )

Source from the content-addressed store, hash-verified

80 return self._core_bpe.encode_ordinary(text)
81
82 def encode(
83 self,
84 text: str,
85 *,
86 allowed_special: Literal["all"] | AbstractSet[str] = set(), # noqa: B006
87 disallowed_special: Literal["all"] | Collection[str] = "all",
88 ) -> list[int]:
89 """Encodes a string into tokens.
90
91 Special tokens are artificial tokens used to unlock capabilities from a model,
92 such as fill-in-the-middle. So we want to be careful about accidentally encoding special
93 tokens, since they can be used to trick a model into doing something we don't want it to do.
94
95 Hence, by default, encode will raise an error if it encounters text that corresponds
96 to a special token. This can be controlled on a per-token level using the `allowed_special`
97 and `disallowed_special` parameters. In particular:
98 - Setting `disallowed_special` to () will prevent this function from raising errors and
99 cause all text corresponding to special tokens to be encoded as natural text.
100 - Setting `allowed_special` to "all" will cause this function to treat all text
101 corresponding to special tokens to be encoded as special tokens.
102
103 ```
104 >>> enc.encode("hello world")
105 [31373, 995]
106 >>> enc.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})
107 [50256]
108 >>> enc.encode("<|endoftext|>", allowed_special="all")
109 [50256]
110 >>> enc.encode("<|endoftext|>")
111 # Raises ValueError
112 >>> enc.encode("<|endoftext|>", disallowed_special=())
113 [27, 91, 437, 1659, 5239, 91, 29]
114 ```
115 """
116 if allowed_special == "all":
117 allowed_special = self.special_tokens_set
118 if disallowed_special == "all":
119 disallowed_special = self.special_tokens_set - allowed_special
120 if disallowed_special:
121 if not isinstance(disallowed_special, frozenset):
122 disallowed_special = frozenset(disallowed_special)
123 if match := _special_token_regex(disallowed_special).search(text):
124 raise_disallowed_special_token(match.group())
125
126 try:
127 return self._core_bpe.encode(text, allowed_special)
128 except UnicodeEncodeError:
129 # BPE operates on bytes, but the regex operates on unicode. If we pass a str that is
130 # invalid UTF-8 to Rust, it will rightfully complain. Here we do a quick and dirty
131 # fixup for any surrogate pairs that may have sneaked their way into the text.
132 # Technically, this introduces a place where encode + decode doesn't roundtrip a Python
133 # string, but given that this is input we want to support, maybe that's okay.
134 # Also we use errors="replace" to handle weird things like lone surrogates.
135 text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
136 return self._core_bpe.encode(text, allowed_special)
137
138 def encode_to_numpy(
139 self,

Callers 15

test_pickleFunction · 0.95
benchmark_batchFunction · 0.45
test_hyp_offsetsFunction · 0.45
test_basic_offsetsFunction · 0.45
test_simpleFunction · 0.45
test_simple_repeatedFunction · 0.45
test_large_repeatedFunction · 0.45
test_simple_regexFunction · 0.45
test_basic_encodeFunction · 0.45
test_encode_emptyFunction · 0.45

Calls 3

_special_token_regexFunction · 0.85
decodeMethod · 0.45

Tested by 15

test_pickleFunction · 0.76
test_hyp_offsetsFunction · 0.36
test_basic_offsetsFunction · 0.36
test_simpleFunction · 0.36
test_simple_repeatedFunction · 0.36
test_large_repeatedFunction · 0.36
test_simple_regexFunction · 0.36
test_basic_encodeFunction · 0.36
test_encode_emptyFunction · 0.36
test_basic_roundtripFunction · 0.36