hub / github.com/dask/dask / process_val_weights

Function process_val_weights

dask/dataframe/partitionquantiles.py:303–395 · view source on GitHub ↗

Calculate final approximate percentiles given weighted vals ``vals_and_weights`` is assumed to be sorted. We take a cumulative sum of the weights, which makes them percentile-like (their scale is [0, N] instead of [0, 100]). Next we find the divisions to create partitions of appro

(vals_and_weights, npartitions, dtype_info)

Source from the content-addressed store, hash-verified

301
302
303	def process_val_weights(vals_and_weights, npartitions, dtype_info):
304	"""Calculate final approximate percentiles given weighted vals
305
306	``vals_and_weights`` is assumed to be sorted. We take a cumulative
307	sum of the weights, which makes them percentile-like (their scale is
308	[0, N] instead of [0, 100]). Next we find the divisions to create
309	partitions of approximately equal size.
310
311	It is possible for adjacent values of the result to be the same. Since
312	these determine the divisions of the new partitions, some partitions
313	may be empty. This can happen if we under-sample the data, or if there
314	aren't enough unique values in the column. Increasing ``upsample``
315	keyword argument in ``df.set_index`` may help.
316	"""
317	dtype, info = dtype_info
318
319	if not vals_and_weights:
320	try:
321	return np.array(None, dtype=dtype)
322	except Exception:
323	# dtype does not support None value so allow it to change
324	return np.array(None, dtype=np.float64)
325
326	vals, weights = vals_and_weights
327	vals = np.array(vals)
328	weights = np.array(weights)
329
330	# We want to create exactly `npartition` number of groups of `vals` that
331	# are approximately the same weight and non-empty if possible. We use a
332	# simple approach (more accurate algorithms exist):
333	# 1. Remove all the values with weights larger than the relative
334	# percentile width from consideration (these are `jumbo`s)
335	# 2. Calculate percentiles with "interpolation=left" of percentile-like
336	# weights of the remaining values. These are guaranteed to be unique.
337	# 3. Concatenate the values from (1) and (2), sort, and return.
338	#
339	# We assume that all values are unique, which happens in the previous
340	# step `merge_and_compress_summaries`.
341
342	if len(vals) == npartitions + 1:
343	rv = vals
344	elif len(vals) < npartitions + 1:
345	# The data is under-sampled
346	if np.issubdtype(vals.dtype, np.number) and not isinstance(
347	dtype, pd.CategoricalDtype
348	):
349	# Interpolate extra divisions
350	q_weights = np.cumsum(weights)
351	q_target = np.linspace(q_weights[0], q_weights[-1], npartitions + 1)
352	rv = np.interp(q_target, q_weights, vals)
353	else:
354	# Distribute the empty partitions
355	duplicated_index = np.linspace(
356	0, len(vals) - 1, npartitions - len(vals) + 1, dtype=int
357	)
358	duplicated_vals = vals[duplicated_index]
359	rv = np.concatenate([vals, duplicated_vals])
360	rv.sort()

Callers

nothing calls this directly

Calls 2

cumsumMethod · 0.45

sumMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…