Materialize and split the dataset using proportions. A common use case for this is splitting the dataset into train and test sets (equivalent to eg. scikit-learn's ``train_test_split``). For a higher level abstraction, see :meth:`Dataset.train_test_split`. This meth
(
self, proportions: List[float]
)
| 2531 | @ConsumptionAPI |
| 2532 | @PublicAPI(api_group=SMJ_API_GROUP) |
| 2533 | def split_proportionately( |
| 2534 | self, proportions: List[float] |
| 2535 | ) -> List["MaterializedDataset"]: |
| 2536 | """Materialize and split the dataset using proportions. |
| 2537 | |
| 2538 | A common use case for this is splitting the dataset into train |
| 2539 | and test sets (equivalent to eg. scikit-learn's ``train_test_split``). |
| 2540 | For a higher level abstraction, see :meth:`Dataset.train_test_split`. |
| 2541 | |
| 2542 | This method splits datasets so that all splits |
| 2543 | always contains at least one row. If that isn't possible, |
| 2544 | an exception is raised. |
| 2545 | |
| 2546 | This is equivalent to caulculating the indices manually and calling |
| 2547 | :meth:`Dataset.split_at_indices`. |
| 2548 | |
| 2549 | Examples: |
| 2550 | >>> import ray |
| 2551 | >>> ds = ray.data.range(10) |
| 2552 | >>> d1, d2, d3 = ds.split_proportionately([0.2, 0.5]) |
| 2553 | >>> d1.take_batch() |
| 2554 | {'id': array([0, 1])} |
| 2555 | >>> d2.take_batch() |
| 2556 | {'id': array([2, 3, 4, 5, 6])} |
| 2557 | >>> d3.take_batch() |
| 2558 | {'id': array([7, 8, 9])} |
| 2559 | |
| 2560 | Time complexity: O(num splits) |
| 2561 | |
| 2562 | Args: |
| 2563 | proportions: List of proportions to split the dataset according to. |
| 2564 | Must sum up to less than 1, and each proportion must be bigger |
| 2565 | than 0. |
| 2566 | |
| 2567 | Returns: |
| 2568 | The dataset splits. |
| 2569 | |
| 2570 | .. seealso:: |
| 2571 | |
| 2572 | :meth:`Dataset.split` |
| 2573 | Unlike :meth:`~Dataset.split_proportionately`, which lets you split a |
| 2574 | dataset into different sizes, :meth:`Dataset.split` splits a dataset |
| 2575 | into approximately equal splits. |
| 2576 | |
| 2577 | :meth:`Dataset.split_at_indices` |
| 2578 | :meth:`Dataset.split_proportionately` uses this method under the hood. |
| 2579 | |
| 2580 | :meth:`Dataset.streaming_split`. |
| 2581 | Unlike :meth:`~Dataset.split`, :meth:`~Dataset.streaming_split` |
| 2582 | doesn't materialize the dataset in memory. |
| 2583 | """ |
| 2584 | |
| 2585 | if len(proportions) < 1: |
| 2586 | raise ValueError("proportions must be at least of length 1") |
| 2587 | if sum(proportions) >= 1: |
| 2588 | raise ValueError("proportions must sum to less than 1") |
| 2589 | if any(p <= 0 for p in proportions): |
| 2590 | raise ValueError("proportions must be bigger than 0") |