MCPcopy
hub / github.com/huggingface/datasets / concatenate_datasets

Function concatenate_datasets

src/datasets/combine.py:168–232  ·  view source on GitHub ↗

Concatenate several datasets (sources) into a single dataset. Use axis=0 to concatenate vertically (default), or axis=1 to concatenate horizontally. Note for iterable datasets: * if axis=0, the resulting dataset's `num_shards` is the sum of each dataset's `num_shards`. * if a

(
    dsets: list[DatasetType],
    info: Optional[DatasetInfo] = None,
    split: Optional[NamedSplit] = None,
    axis: int = 0,
)

Source from the content-addressed store, hash-verified

166
167
168def concatenate_datasets(
169 dsets: list[DatasetType],
170 info: Optional[DatasetInfo] = None,
171 split: Optional[NamedSplit] = None,
172 axis: int = 0,
173) -> DatasetType:
174 """
175 Concatenate several datasets (sources) into a single dataset.
176
177 Use axis=0 to concatenate vertically (default), or axis=1 to concatenate horizontally.
178
179 Note for iterable datasets:
180
181 * if axis=0, the resulting dataset's `num_shards` is the sum of each dataset's `num_shards`.
182 * if axis=1, the resulting dataset has one (1) shard to not misalign data.
183
184 Args:
185 dsets (`List[datasets.Dataset]` or `List[datasets.IterableDataset]`):
186 List of Datasets to concatenate.
187 info (`DatasetInfo`, *optional*):
188 Dataset information, like description, citation, etc.
189 split (`NamedSplit`, *optional*):
190 Name of the dataset split.
191 axis (`{0, 1}`, defaults to `0`):
192 Axis to concatenate over, where `0` means over rows (vertically) and `1` means over columns
193 (horizontally).
194
195 <Added version="1.6.0"/>
196
197 Example:
198
199 ```py
200 >>> ds3 = concatenate_datasets([ds1, ds2])
201 ```
202 """
203
204 if not dsets:
205 raise ValueError("Unable to concatenate an empty list of datasets.")
206 for i, dataset in enumerate(dsets):
207 if not isinstance(dataset, (Dataset, IterableDataset)):
208 if isinstance(dataset, (DatasetDict, IterableDatasetDict)):
209 if not dataset:
210 raise ValueError(
211 f"Expected a list of Dataset objects or a list of IterableDataset objects, but element at position {i} "
212 "is an empty dataset dictionary."
213 )
214 raise ValueError(
215 f"Dataset at position {i} has at least one split: {list(dataset)}\n"
216 f"Please pick one to interleave with the other datasets, for example: dataset['{next(iter(dataset))}']"
217 )
218 raise ValueError(
219 f"Expected a list of Dataset objects or a list of IterableDataset objects, but element at position {i} is a {type(dataset).__name__}."
220 )
221 if i == 0:
222 dataset_type, other_type = (
223 (Dataset, IterableDataset) if isinstance(dataset, Dataset) else (IterableDataset, Dataset)
224 )
225 elif not isinstance(dataset, dataset_type):