hub / github.com/NVIDIA-NeMo/RL / get_tokenizer

Function get_tokenizer

nemo_rl/algorithms/utils.py:233–403 · view source on GitHub ↗

Get the tokenizer and set pad token to eos token if it is not already set. This function initializes a tokenizer from the Hugging Face transformers library and configures it with appropriate chat templates and padding tokens. Args: tokenizer_config: A dictionary containing toke

(
    tokenizer_config: TokenizerConfig, get_processor: bool = False
)

Source from the content-addressed store, hash-verified

231
232
233	def get_tokenizer(
234	tokenizer_config: TokenizerConfig, get_processor: bool = False
235	) -> PreTrainedTokenizerBase:
236	"""Get the tokenizer and set pad token to eos token if it is not already set.
237
238	This function initializes a tokenizer from the Hugging Face transformers library
239	and configures it with appropriate chat templates and padding tokens.
240
241	Args:
242	tokenizer_config: A dictionary containing tokenizer configuration.
243	Required keys:
244	- name: The name or path of the pretrained tokenizer
245	Optional keys:
246	- chat_template: The chat template to use. Can be:
247	- None: Uses a passthrough template that just returns message content
248	- "default": Uses the tokenizer's default template
249	- A custom jinja2 template string
250	If not specified, the tokenizer's default template will be used.
251	get_processor: Whether to return a processor (via AutoProcessor) instead of a tokenizer.
252
253	Returns:
254	PreTrainedTokenizerBase: The configured tokenizer instance
255
256	Examples:
257	```{doctest}
258	>>> from transformers import AutoTokenizer
259	>>> from nemo_rl.algorithms.utils import get_tokenizer
260	>>> # not specifying a chat template uses the tokenizer's default
261	>>> config = {"name": "meta-llama/Llama-3.2-1B-Instruct"}
262	>>> tokenizer = get_tokenizer(config)
263	No chat template provided, using tokenizer's default
264	>>> messages = [
265	... {"role": "system", "content": "You are a helpful AI assistant."},
266	... {"role": "user", "content": "Hello!"}
267	... ]
268	>>> formatted = tokenizer.apply_chat_template(messages, tokenize=False)
269	>>> assert formatted == AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct").apply_chat_template(messages, tokenize=False)
270
271	>>> # Using a passthrough template
272	>>> config = {
273	... "name": "meta-llama/Llama-3.2-1B-Instruct",
274	... "chat_template": None
275	... }
276	>>> tokenizer = get_tokenizer(config)
277	Using passthrough chat template
278	>>> formatted = tokenizer.apply_chat_template(messages, tokenize=False)
279	>>> assert formatted == "".join(msg["content"] for msg in messages)
280
281	>>> # Using a custom template
282	>>> config = {
283	... "name": "meta-llama/Llama-3.2-1B-Instruct",
284	... "chat_template": "{% for message in messages %}{{ ' START: ' + message['content'] + ' END.' }}{% endfor %}"
285	... }
286	>>> tokenizer = get_tokenizer(config)
287	Using custom chat template
288	>>> formatted = tokenizer.apply_chat_template(messages, tokenize=False)
289	>>> assert formatted == " START: You are a helpful AI assistant. END. START: Hello! END."
290

Callers 15

mainFunction · 0.90

__init__Method · 0.90

tokenizerFunction · 0.90

create_dataloaderFunction · 0.90

test_get_formatted_message_log_qwen3_enable_thinkingFunction · 0.90

test_math_data_processorFunction · 0.90

test_math_hf_data_processorFunction · 0.90

test_eval_math_hf_data_processorFunction · 0.90

tokenizerFunction · 0.90

policy_setupFunction · 0.90

training_setupFunction · 0.90

Calls

no outgoing calls

Tested by 15

tokenizerFunction · 0.72

create_dataloaderFunction · 0.72

test_get_formatted_message_log_qwen3_enable_thinkingFunction · 0.72

test_math_data_processorFunction · 0.72

test_math_hf_data_processorFunction · 0.72

test_eval_math_hf_data_processorFunction · 0.72

tokenizerFunction · 0.72

policy_setupFunction · 0.72

training_setupFunction · 0.72

generation_setupFunction · 0.72

logprob_setupFunction · 0.72