MCPcopy Index your code
hub / github.com/NVIDIA-NeMo/RL / get_tokenizer

Function get_tokenizer

nemo_rl/algorithms/utils.py:233–403  ·  view source on GitHub ↗

Get the tokenizer and set pad token to eos token if it is not already set. This function initializes a tokenizer from the Hugging Face transformers library and configures it with appropriate chat templates and padding tokens. Args: tokenizer_config: A dictionary containing toke

(
    tokenizer_config: TokenizerConfig, get_processor: bool = False
)

Source from the content-addressed store, hash-verified

231
232
233def get_tokenizer(
234 tokenizer_config: TokenizerConfig, get_processor: bool = False
235) -> PreTrainedTokenizerBase:
236 """Get the tokenizer and set pad token to eos token if it is not already set.
237
238 This function initializes a tokenizer from the Hugging Face transformers library
239 and configures it with appropriate chat templates and padding tokens.
240
241 Args:
242 tokenizer_config: A dictionary containing tokenizer configuration.
243 Required keys:
244 - name: The name or path of the pretrained tokenizer
245 Optional keys:
246 - chat_template: The chat template to use. Can be:
247 - None: Uses a passthrough template that just returns message content
248 - "default": Uses the tokenizer's default template
249 - A custom jinja2 template string
250 If not specified, the tokenizer's default template will be used.
251 get_processor: Whether to return a processor (via AutoProcessor) instead of a tokenizer.
252
253 Returns:
254 PreTrainedTokenizerBase: The configured tokenizer instance
255
256 Examples:
257 ```{doctest}
258 >>> from transformers import AutoTokenizer
259 >>> from nemo_rl.algorithms.utils import get_tokenizer
260 >>> # not specifying a chat template uses the tokenizer's default
261 >>> config = {"name": "meta-llama/Llama-3.2-1B-Instruct"}
262 >>> tokenizer = get_tokenizer(config)
263 No chat template provided, using tokenizer's default
264 >>> messages = [
265 ... {"role": "system", "content": "You are a helpful AI assistant."},
266 ... {"role": "user", "content": "Hello!"}
267 ... ]
268 >>> formatted = tokenizer.apply_chat_template(messages, tokenize=False)
269 >>> assert formatted == AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct").apply_chat_template(messages, tokenize=False)
270
271 >>> # Using a passthrough template
272 >>> config = {
273 ... "name": "meta-llama/Llama-3.2-1B-Instruct",
274 ... "chat_template": None
275 ... }
276 >>> tokenizer = get_tokenizer(config)
277 Using passthrough chat template
278 >>> formatted = tokenizer.apply_chat_template(messages, tokenize=False)
279 >>> assert formatted == "".join(msg["content"] for msg in messages)
280
281 >>> # Using a custom template
282 >>> config = {
283 ... "name": "meta-llama/Llama-3.2-1B-Instruct",
284 ... "chat_template": "{% for message in messages %}{{ ' START: ' + message['content'] + ' END.' }}{% endfor %}"
285 ... }
286 >>> tokenizer = get_tokenizer(config)
287 Using custom chat template
288 >>> formatted = tokenizer.apply_chat_template(messages, tokenize=False)
289 >>> assert formatted == " START: You are a helpful AI assistant. END. START: Hello! END."
290

Callers 15

mainFunction · 0.90
__init__Method · 0.90
tokenizerFunction · 0.90
create_dataloaderFunction · 0.90
test_math_data_processorFunction · 0.90
tokenizerFunction · 0.90
tokenizerFunction · 0.90
policy_setupFunction · 0.90
training_setupFunction · 0.90

Calls

no outgoing calls

Tested by 15

tokenizerFunction · 0.72
create_dataloaderFunction · 0.72
test_math_data_processorFunction · 0.72
tokenizerFunction · 0.72
tokenizerFunction · 0.72
policy_setupFunction · 0.72
training_setupFunction · 0.72
generation_setupFunction · 0.72
logprob_setupFunction · 0.72