Concatenate several datasets (sources) into a single dataset. Use axis=0 to concatenate vertically (default), or axis=1 to concatenate horizontally. Note for iterable datasets: * if axis=0, the resulting dataset's `num_shards` is the sum of each dataset's `num_shards`. * if a
(
dsets: list[DatasetType],
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
axis: int = 0,
)
| 166 | |
| 167 | |
| 168 | def concatenate_datasets( |
| 169 | dsets: list[DatasetType], |
| 170 | info: Optional[DatasetInfo] = None, |
| 171 | split: Optional[NamedSplit] = None, |
| 172 | axis: int = 0, |
| 173 | ) -> DatasetType: |
| 174 | """ |
| 175 | Concatenate several datasets (sources) into a single dataset. |
| 176 | |
| 177 | Use axis=0 to concatenate vertically (default), or axis=1 to concatenate horizontally. |
| 178 | |
| 179 | Note for iterable datasets: |
| 180 | |
| 181 | * if axis=0, the resulting dataset's `num_shards` is the sum of each dataset's `num_shards`. |
| 182 | * if axis=1, the resulting dataset has one (1) shard to not misalign data. |
| 183 | |
| 184 | Args: |
| 185 | dsets (`List[datasets.Dataset]` or `List[datasets.IterableDataset]`): |
| 186 | List of Datasets to concatenate. |
| 187 | info (`DatasetInfo`, *optional*): |
| 188 | Dataset information, like description, citation, etc. |
| 189 | split (`NamedSplit`, *optional*): |
| 190 | Name of the dataset split. |
| 191 | axis (`{0, 1}`, defaults to `0`): |
| 192 | Axis to concatenate over, where `0` means over rows (vertically) and `1` means over columns |
| 193 | (horizontally). |
| 194 | |
| 195 | <Added version="1.6.0"/> |
| 196 | |
| 197 | Example: |
| 198 | |
| 199 | ```py |
| 200 | >>> ds3 = concatenate_datasets([ds1, ds2]) |
| 201 | ``` |
| 202 | """ |
| 203 | |
| 204 | if not dsets: |
| 205 | raise ValueError("Unable to concatenate an empty list of datasets.") |
| 206 | for i, dataset in enumerate(dsets): |
| 207 | if not isinstance(dataset, (Dataset, IterableDataset)): |
| 208 | if isinstance(dataset, (DatasetDict, IterableDatasetDict)): |
| 209 | if not dataset: |
| 210 | raise ValueError( |
| 211 | f"Expected a list of Dataset objects or a list of IterableDataset objects, but element at position {i} " |
| 212 | "is an empty dataset dictionary." |
| 213 | ) |
| 214 | raise ValueError( |
| 215 | f"Dataset at position {i} has at least one split: {list(dataset)}\n" |
| 216 | f"Please pick one to interleave with the other datasets, for example: dataset['{next(iter(dataset))}']" |
| 217 | ) |
| 218 | raise ValueError( |
| 219 | f"Expected a list of Dataset objects or a list of IterableDataset objects, but element at position {i} is a {type(dataset).__name__}." |
| 220 | ) |
| 221 | if i == 0: |
| 222 | dataset_type, other_type = ( |
| 223 | (Dataset, IterableDataset) if isinstance(dataset, Dataset) else (IterableDataset, Dataset) |
| 224 | ) |
| 225 | elif not isinstance(dataset, dataset_type): |