Calculate final approximate percentiles given weighted vals ``vals_and_weights`` is assumed to be sorted. We take a cumulative sum of the weights, which makes them percentile-like (their scale is [0, N] instead of [0, 100]). Next we find the divisions to create partitions of appro
(vals_and_weights, npartitions, dtype_info)
| 301 | |
| 302 | |
| 303 | def process_val_weights(vals_and_weights, npartitions, dtype_info): |
| 304 | """Calculate final approximate percentiles given weighted vals |
| 305 | |
| 306 | ``vals_and_weights`` is assumed to be sorted. We take a cumulative |
| 307 | sum of the weights, which makes them percentile-like (their scale is |
| 308 | [0, N] instead of [0, 100]). Next we find the divisions to create |
| 309 | partitions of approximately equal size. |
| 310 | |
| 311 | It is possible for adjacent values of the result to be the same. Since |
| 312 | these determine the divisions of the new partitions, some partitions |
| 313 | may be empty. This can happen if we under-sample the data, or if there |
| 314 | aren't enough unique values in the column. Increasing ``upsample`` |
| 315 | keyword argument in ``df.set_index`` may help. |
| 316 | """ |
| 317 | dtype, info = dtype_info |
| 318 | |
| 319 | if not vals_and_weights: |
| 320 | try: |
| 321 | return np.array(None, dtype=dtype) |
| 322 | except Exception: |
| 323 | # dtype does not support None value so allow it to change |
| 324 | return np.array(None, dtype=np.float64) |
| 325 | |
| 326 | vals, weights = vals_and_weights |
| 327 | vals = np.array(vals) |
| 328 | weights = np.array(weights) |
| 329 | |
| 330 | # We want to create exactly `npartition` number of groups of `vals` that |
| 331 | # are approximately the same weight and non-empty if possible. We use a |
| 332 | # simple approach (more accurate algorithms exist): |
| 333 | # 1. Remove all the values with weights larger than the relative |
| 334 | # percentile width from consideration (these are `jumbo`s) |
| 335 | # 2. Calculate percentiles with "interpolation=left" of percentile-like |
| 336 | # weights of the remaining values. These are guaranteed to be unique. |
| 337 | # 3. Concatenate the values from (1) and (2), sort, and return. |
| 338 | # |
| 339 | # We assume that all values are unique, which happens in the previous |
| 340 | # step `merge_and_compress_summaries`. |
| 341 | |
| 342 | if len(vals) == npartitions + 1: |
| 343 | rv = vals |
| 344 | elif len(vals) < npartitions + 1: |
| 345 | # The data is under-sampled |
| 346 | if np.issubdtype(vals.dtype, np.number) and not isinstance( |
| 347 | dtype, pd.CategoricalDtype |
| 348 | ): |
| 349 | # Interpolate extra divisions |
| 350 | q_weights = np.cumsum(weights) |
| 351 | q_target = np.linspace(q_weights[0], q_weights[-1], npartitions + 1) |
| 352 | rv = np.interp(q_target, q_weights, vals) |
| 353 | else: |
| 354 | # Distribute the empty partitions |
| 355 | duplicated_index = np.linspace( |
| 356 | 0, len(vals) - 1, npartitions - len(vals) + 1, dtype=int |
| 357 | ) |
| 358 | duplicated_vals = vals[duplicated_index] |
| 359 | rv = np.concatenate([vals, duplicated_vals]) |
| 360 | rv.sort() |