Helper/wrapper function for :func:`numpy.unique`. Uses :func:`numpy.unique` to find the unique values for the array chunk. Given this chunk may not represent the whole array, also take the ``indices`` and ``counts`` that are in 1-to-1 correspondence to ``ar`` and reduce them in
(ar, indices, counts, return_inverse=False)
| 1647 | |
| 1648 | |
| 1649 | def _unique_internal(ar, indices, counts, return_inverse=False): |
| 1650 | """ |
| 1651 | Helper/wrapper function for :func:`numpy.unique`. |
| 1652 | |
| 1653 | Uses :func:`numpy.unique` to find the unique values for the array chunk. |
| 1654 | Given this chunk may not represent the whole array, also take the |
| 1655 | ``indices`` and ``counts`` that are in 1-to-1 correspondence to ``ar`` |
| 1656 | and reduce them in the same fashion as ``ar`` is reduced. Namely sum |
| 1657 | any counts that correspond to the same value and take the smallest |
| 1658 | index that corresponds to the same value. |
| 1659 | |
| 1660 | To handle the inverse mapping from the unique values to the original |
| 1661 | array, simply return a NumPy array created with ``arange`` with enough |
| 1662 | values to correspond 1-to-1 to the unique values. While there is more |
| 1663 | work needed to be done to create the full inverse mapping for the |
| 1664 | original array, this provides enough information to generate the |
| 1665 | inverse mapping in Dask. |
| 1666 | |
| 1667 | Given Dask likes to have one array returned from functions like |
| 1668 | ``blockwise``, some formatting is done to stuff all of the resulting arrays |
| 1669 | into one big NumPy structured array. Dask is then able to handle this |
| 1670 | object and can split it apart into the separate results on the Dask side, |
| 1671 | which then can be passed back to this function in concatenated chunks for |
| 1672 | further reduction or can be return to the user to perform other forms of |
| 1673 | analysis. |
| 1674 | |
| 1675 | By handling the problem in this way, it does not matter where a chunk |
| 1676 | is in a larger array or how big it is. The chunk can still be computed |
| 1677 | on the same way. Also it does not matter if the chunk is the result of |
| 1678 | other chunks being run through this function multiple times. The end |
| 1679 | result will still be just as accurate using this strategy. |
| 1680 | """ |
| 1681 | |
| 1682 | return_index = indices is not None |
| 1683 | return_counts = counts is not None |
| 1684 | |
| 1685 | u = np.unique(ar) |
| 1686 | |
| 1687 | dt = [("values", u.dtype)] |
| 1688 | if return_index: |
| 1689 | dt.append(("indices", np.intp)) |
| 1690 | if return_inverse: |
| 1691 | dt.append(("inverse", np.intp)) |
| 1692 | if return_counts: |
| 1693 | dt.append(("counts", np.intp)) |
| 1694 | |
| 1695 | r = np.empty(u.shape, dtype=dt) |
| 1696 | r["values"] = u |
| 1697 | if return_inverse: |
| 1698 | r["inverse"] = np.arange(len(r), dtype=np.intp) |
| 1699 | if return_index or return_counts: |
| 1700 | for i, v in enumerate(r["values"]): |
| 1701 | m = ar == v |
| 1702 | if return_index: |
| 1703 | indices[m].min(keepdims=True, out=r["indices"][i : i + 1]) |
| 1704 | if return_counts: |
| 1705 | counts[m].sum(keepdims=True, out=r["counts"][i : i + 1]) |
| 1706 |