MCPcopy Index your code
hub / github.com/NVIDIA-NeMo/RL / AllTaskProcessedDataset

Class AllTaskProcessedDataset

nemo_rl/data/datasets/processed_dataset.py:33–147  ·  view source on GitHub ↗

Dataset for processing single or multi-task data with task-specific tokenization and processing. Args: dataset: Input dataset containing raw data tokenizer: Tokenizer for text processing default_task_data_spec: Default task processing specifications. In the c

Source from the content-addressed store, hash-verified

31
32# TODO @sahilj handle too-long prompts and masking them out throughout the whole process and renormalizing on loss
33class AllTaskProcessedDataset:
34 """Dataset for processing single or multi-task data with task-specific tokenization and processing.
35
36 Args:
37 dataset: Input dataset containing raw data
38 tokenizer: Tokenizer for text processing
39 default_task_data_spec: Default task processing specifications.
40 In the case of single-task, this is the spec used for processing all entries.
41 In the case of multi-task, any values not specified in the task-specific specs will be taken from the default spec.
42 task_data_processors: Either a single TaskDataProcessFnCallable for single-task,
43 or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task
44 max_seq_length: Maximum sequence length for tokenized outputs
45 """
46
47 def __init__(
48 self,
49 dataset: Dataset | Any,
50 tokenizer: TokenizerType,
51 default_task_data_spec: TaskDataSpec,
52 task_data_processors: (
53 dict[str, tuple[TaskDataSpec, TaskDataProcessFnCallable]]
54 | TaskDataProcessFnCallable
55 ),
56 task_data_preprocessors: Optional[
57 Union[dict[str, TaskDataPreProcessFnCallable], TaskDataPreProcessFnCallable]
58 ] = None,
59 max_seq_length: Optional[int] = None,
60 ):
61 self.dataset = dataset
62 self.tokenizer = tokenizer
63 # TODO @yukih: will be removed once eval datasets are adapted
64 self.default_task_data_spec = default_task_data_spec
65 self.task_data_processors = task_data_processors
66 self.task_data_preprocessors = task_data_preprocessors
67 self.max_seq_length = max_seq_length
68 self._bos_checked = False
69
70 if (
71 isinstance(task_data_processors, dict)
72 and default_task_data_spec is not None
73 ):
74 # apply defaults to all task data specs
75 for _, (task_data_spec, _) in task_data_processors.items():
76 task_data_spec.copy_defaults(self.default_task_data_spec)
77
78 def __len__(self) -> int:
79 return len(self.dataset)
80
81 def encode_single(
82 self, text: Union[str, list[str]]
83 ) -> tuple[list[int] | torch.Tensor, int]:
84 """Takes either a single string or a list of strings that represent multiple turns for the same conversation.
85
86 Returns a single (concatenated) list of tokenized ids and the length of the tokenized ids.
87 """
88 if isinstance(text, str):
89 text_ids = self.tokenizer.text_to_ids(text)
90 return text_ids, len(text_ids)

Callers 8

setup_response_dataFunction · 0.90
setup_preference_dataFunction · 0.90
create_dataloaderFunction · 0.90
test_math_data_processorFunction · 0.90
setup_dataFunction · 0.90
setup_dataFunction · 0.90

Calls

no outgoing calls

Tested by 4

create_dataloaderFunction · 0.72
test_math_data_processorFunction · 0.72