hub / github.com/huggingface/datasets / approximate_mode

Function approximate_mode

src/datasets/utils/stratify.py:4–51 · view source on GitHub ↗

Computes approximate mode of multivariate hypergeometric. This is an approximation to the mode of the multivariate hypergeometric given by class_counts and n_draws. It shouldn't be off by more than one. It is the mostly likely outcome of drawing n_draws many samples from the popu

(class_counts, n_draws, rng)

Source from the content-addressed store, hash-verified

2
3
4	def approximate_mode(class_counts, n_draws, rng):
5	"""Computes approximate mode of multivariate hypergeometric.
6	This is an approximation to the mode of the multivariate
7	hypergeometric given by class_counts and n_draws.
8	It shouldn't be off by more than one.
9	It is the mostly likely outcome of drawing n_draws many
10	samples from the population given by class_counts.
11	Args
12	----------
13	class_counts : ndarray of int
14	Population per class.
15	n_draws : int
16	Number of draws (samples to draw) from the overall population.
17	rng : random state
18	Used to break ties.
19	Returns
20	-------
21	sampled_classes : ndarray of int
22	Number of samples drawn from each class.
23	np.sum(sampled_classes) == n_draws
24
25	"""
26	# this computes a bad approximation to the mode of the
27	# multivariate hypergeometric given by class_counts and n_draws
28	continuous = n_draws * class_counts / class_counts.sum()
29	# floored means we don't overshoot n_samples, but probably undershoot
30	floored = np.floor(continuous)
31	# we add samples according to how much "left over" probability
32	# they had, until we arrive at n_samples
33	need_to_add = int(n_draws - floored.sum())
34	if need_to_add > 0:
35	remainder = continuous - floored
36	values = np.sort(np.unique(remainder))[::-1]
37	# add according to remainder, but break ties
38	# randomly to avoid biases
39	for value in values:
40	(inds,) = np.where(remainder == value)
41	# if we need_to_add less than what's in inds
42	# we draw randomly from them.
43	# if we need to add more, we add them all and
44	# go to the next value
45	add_now = min(len(inds), need_to_add)
46	inds = rng.choice(inds, size=add_now, replace=False)
47	floored[inds] += 1
48	need_to_add -= add_now
49	if need_to_add == 0:
50	break
51	return floored.astype(np.int64)
52
53
54	def stratified_shuffle_split_generate_indices(y, n_train, n_test, rng, n_splits=10):

Callers 1

stratified_shuffle_split_generate_indicesFunction · 0.85

Calls 2

sortMethod · 0.45

uniqueMethod · 0.45

Tested by

no test coverage detected