MCPcopy Index your code
hub / github.com/dask/dask / process_val_weights

Function process_val_weights

dask/dataframe/partitionquantiles.py:303–395  ·  view source on GitHub ↗

Calculate final approximate percentiles given weighted vals ``vals_and_weights`` is assumed to be sorted. We take a cumulative sum of the weights, which makes them percentile-like (their scale is [0, N] instead of [0, 100]). Next we find the divisions to create partitions of appro

(vals_and_weights, npartitions, dtype_info)

Source from the content-addressed store, hash-verified

301
302
303def process_val_weights(vals_and_weights, npartitions, dtype_info):
304 """Calculate final approximate percentiles given weighted vals
305
306 ``vals_and_weights`` is assumed to be sorted. We take a cumulative
307 sum of the weights, which makes them percentile-like (their scale is
308 [0, N] instead of [0, 100]). Next we find the divisions to create
309 partitions of approximately equal size.
310
311 It is possible for adjacent values of the result to be the same. Since
312 these determine the divisions of the new partitions, some partitions
313 may be empty. This can happen if we under-sample the data, or if there
314 aren't enough unique values in the column. Increasing ``upsample``
315 keyword argument in ``df.set_index`` may help.
316 """
317 dtype, info = dtype_info
318
319 if not vals_and_weights:
320 try:
321 return np.array(None, dtype=dtype)
322 except Exception:
323 # dtype does not support None value so allow it to change
324 return np.array(None, dtype=np.float64)
325
326 vals, weights = vals_and_weights
327 vals = np.array(vals)
328 weights = np.array(weights)
329
330 # We want to create exactly `npartition` number of groups of `vals` that
331 # are approximately the same weight and non-empty if possible. We use a
332 # simple approach (more accurate algorithms exist):
333 # 1. Remove all the values with weights larger than the relative
334 # percentile width from consideration (these are `jumbo`s)
335 # 2. Calculate percentiles with "interpolation=left" of percentile-like
336 # weights of the remaining values. These are guaranteed to be unique.
337 # 3. Concatenate the values from (1) and (2), sort, and return.
338 #
339 # We assume that all values are unique, which happens in the previous
340 # step `merge_and_compress_summaries`.
341
342 if len(vals) == npartitions + 1:
343 rv = vals
344 elif len(vals) < npartitions + 1:
345 # The data is under-sampled
346 if np.issubdtype(vals.dtype, np.number) and not isinstance(
347 dtype, pd.CategoricalDtype
348 ):
349 # Interpolate extra divisions
350 q_weights = np.cumsum(weights)
351 q_target = np.linspace(q_weights[0], q_weights[-1], npartitions + 1)
352 rv = np.interp(q_target, q_weights, vals)
353 else:
354 # Distribute the empty partitions
355 duplicated_index = np.linspace(
356 0, len(vals) - 1, npartitions - len(vals) + 1, dtype=int
357 )
358 duplicated_vals = vals[duplicated_index]
359 rv = np.concatenate([vals, duplicated_vals])
360 rv.sort()

Callers

nothing calls this directly

Calls 2

cumsumMethod · 0.45
sumMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…