hub / github.com/ray-project/ray / random_split

Method random_split

python/ray/data/dataset.py:2856–2878 · view source on GitHub ↗

Perform a random split on a batch: each row goes to train with probability (1 - test_proportion), or to test otherwise. This version ensures that the random choices are **stable per Ray task execution** by seeding the RNG with a combination of a user

(batch: pa.Table)

Source from the content-addressed store, hash-verified

2854	raise ValueError("hash_column is not supported for random split")
2855
2856	def random_split(batch: pa.Table):
2857	"""
2858	Perform a random split on a batch: each row goes to train with probability (1 - test_proportion),
2859	or to test otherwise.
2860
2861	This version ensures that the random choices are stable per Ray task execution by seeding
2862	the RNG with a combination of a user-specified seed and the Ray task ID.
2863	"""
2864	ctx = TaskContext.get_current()
2865	if "train_test_split_rng" in ctx.kwargs:
2866	rng = ctx.kwargs["train_test_split_rng"]
2867	elif seed is None:
2868	rng = np.random.default_rng([ctx.task_idx])
2869	ctx.kwargs["train_test_split_rng"] = rng
2870	else:
2871	rng = np.random.default_rng([ctx.task_idx, seed])
2872	ctx.kwargs["train_test_split_rng"] = rng
2873
2874	# Draw Bernoulli samples: 1 = train, 0 = test
2875	is_train = rng.random(batch.num_rows) < (1 - test_size)
2876	return batch.append_column(
2877	_TRAIN_TEST_SPLIT_COLUMN, pa.array(is_train, type=pa.bool_())
2878	)
2879
2880	def hash_split(batch: pa.Table) -> tuple[pa.Table, pa.Table]:
2881	def key_to_bucket(key: Any) -> int:

Callers

nothing calls this directly

Calls 1

get_currentMethod · 0.45

Tested by

no test coverage detected