hub / github.com/ray-project/ray / split_proportionately

Method split_proportionately

python/ray/data/dataset.py:2533–2610 · view source on GitHub ↗

Materialize and split the dataset using proportions. A common use case for this is splitting the dataset into train and test sets (equivalent to eg. scikit-learn's ``train_test_split``). For a higher level abstraction, see :meth:`Dataset.train_test_split`. This meth

(
        self, proportions: List[float]
    )

Source from the content-addressed store, hash-verified

2531	@ConsumptionAPI
2532	@PublicAPI(api_group=SMJ_API_GROUP)
2533	def split_proportionately(
2534	self, proportions: List[float]
2535	) -> List["MaterializedDataset"]:
2536	"""Materialize and split the dataset using proportions.
2537
2538	A common use case for this is splitting the dataset into train
2539	and test sets (equivalent to eg. scikit-learn's ``train_test_split``).
2540	For a higher level abstraction, see :meth:`Dataset.train_test_split`.
2541
2542	This method splits datasets so that all splits
2543	always contains at least one row. If that isn't possible,
2544	an exception is raised.
2545
2546	This is equivalent to caulculating the indices manually and calling
2547	:meth:`Dataset.split_at_indices`.
2548
2549	Examples:
2550	>>> import ray
2551	>>> ds = ray.data.range(10)
2552	>>> d1, d2, d3 = ds.split_proportionately([0.2, 0.5])
2553	>>> d1.take_batch()
2554	{'id': array([0, 1])}
2555	>>> d2.take_batch()
2556	{'id': array([2, 3, 4, 5, 6])}
2557	>>> d3.take_batch()
2558	{'id': array([7, 8, 9])}
2559
2560	Time complexity: O(num splits)
2561
2562	Args:
2563	proportions: List of proportions to split the dataset according to.
2564	Must sum up to less than 1, and each proportion must be bigger
2565	than 0.
2566
2567	Returns:
2568	The dataset splits.
2569
2570	.. seealso::
2571
2572	:meth:`Dataset.split`
2573	Unlike :meth:`~Dataset.split_proportionately`, which lets you split a
2574	dataset into different sizes, :meth:`Dataset.split` splits a dataset
2575	into approximately equal splits.
2576
2577	:meth:`Dataset.split_at_indices`
2578	:meth:`Dataset.split_proportionately` uses this method under the hood.
2579
2580	:meth:`Dataset.streaming_split`.
2581	Unlike :meth:`~Dataset.split`, :meth:`~Dataset.streaming_split`
2582	doesn't materialize the dataset in memory.
2583	"""
2584
2585	if len(proportions) < 1:
2586	raise ValueError("proportions must be at least of length 1")
2587	if sum(proportions) >= 1:
2588	raise ValueError("proportions must sum to less than 1")
2589	if any(p <= 0 for p in proportions):
2590	raise ValueError("proportions must be bigger than 0")

Callers 3

train_test_splitMethod · 0.80

test_split_proportionatelyFunction · 0.80

04e_rec_sys_workload_pattern.pyFile · 0.80

Calls 4

_try_count_or_materializeMethod · 0.95

split_at_indicesMethod · 0.80

rangeFunction · 0.70

sumFunction · 0.50

Tested by 1

test_split_proportionatelyFunction · 0.64