Get the tokenizer and set pad token to eos token if it is not already set. This function initializes a tokenizer from the Hugging Face transformers library and configures it with appropriate chat templates and padding tokens. Args: tokenizer_config: A dictionary containing toke
(
tokenizer_config: TokenizerConfig, get_processor: bool = False
)
| 231 | |
| 232 | |
| 233 | def get_tokenizer( |
| 234 | tokenizer_config: TokenizerConfig, get_processor: bool = False |
| 235 | ) -> PreTrainedTokenizerBase: |
| 236 | """Get the tokenizer and set pad token to eos token if it is not already set. |
| 237 | |
| 238 | This function initializes a tokenizer from the Hugging Face transformers library |
| 239 | and configures it with appropriate chat templates and padding tokens. |
| 240 | |
| 241 | Args: |
| 242 | tokenizer_config: A dictionary containing tokenizer configuration. |
| 243 | Required keys: |
| 244 | - name: The name or path of the pretrained tokenizer |
| 245 | Optional keys: |
| 246 | - chat_template: The chat template to use. Can be: |
| 247 | - None: Uses a passthrough template that just returns message content |
| 248 | - "default": Uses the tokenizer's default template |
| 249 | - A custom jinja2 template string |
| 250 | If not specified, the tokenizer's default template will be used. |
| 251 | get_processor: Whether to return a processor (via AutoProcessor) instead of a tokenizer. |
| 252 | |
| 253 | Returns: |
| 254 | PreTrainedTokenizerBase: The configured tokenizer instance |
| 255 | |
| 256 | Examples: |
| 257 | ```{doctest} |
| 258 | >>> from transformers import AutoTokenizer |
| 259 | >>> from nemo_rl.algorithms.utils import get_tokenizer |
| 260 | >>> # not specifying a chat template uses the tokenizer's default |
| 261 | >>> config = {"name": "meta-llama/Llama-3.2-1B-Instruct"} |
| 262 | >>> tokenizer = get_tokenizer(config) |
| 263 | No chat template provided, using tokenizer's default |
| 264 | >>> messages = [ |
| 265 | ... {"role": "system", "content": "You are a helpful AI assistant."}, |
| 266 | ... {"role": "user", "content": "Hello!"} |
| 267 | ... ] |
| 268 | >>> formatted = tokenizer.apply_chat_template(messages, tokenize=False) |
| 269 | >>> assert formatted == AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct").apply_chat_template(messages, tokenize=False) |
| 270 | |
| 271 | >>> # Using a passthrough template |
| 272 | >>> config = { |
| 273 | ... "name": "meta-llama/Llama-3.2-1B-Instruct", |
| 274 | ... "chat_template": None |
| 275 | ... } |
| 276 | >>> tokenizer = get_tokenizer(config) |
| 277 | Using passthrough chat template |
| 278 | >>> formatted = tokenizer.apply_chat_template(messages, tokenize=False) |
| 279 | >>> assert formatted == "".join(msg["content"] for msg in messages) |
| 280 | |
| 281 | >>> # Using a custom template |
| 282 | >>> config = { |
| 283 | ... "name": "meta-llama/Llama-3.2-1B-Instruct", |
| 284 | ... "chat_template": "{% for message in messages %}{{ ' START: ' + message['content'] + ' END.' }}{% endfor %}" |
| 285 | ... } |
| 286 | >>> tokenizer = get_tokenizer(config) |
| 287 | Using custom chat template |
| 288 | >>> formatted = tokenizer.apply_chat_template(messages, tokenize=False) |
| 289 | >>> assert formatted == " START: You are a helpful AI assistant. END. START: Hello! END." |
| 290 |
no outgoing calls