hub / github.com/dask/dask / _unique_internal

Function _unique_internal

dask/array/routines.py:1649–1707 · view source on GitHub ↗

Helper/wrapper function for :func:`numpy.unique`. Uses :func:`numpy.unique` to find the unique values for the array chunk. Given this chunk may not represent the whole array, also take the ``indices`` and ``counts`` that are in 1-to-1 correspondence to ``ar`` and reduce them in

(ar, indices, counts, return_inverse=False)

Source from the content-addressed store, hash-verified

1647
1648
1649	def _unique_internal(ar, indices, counts, return_inverse=False):
1650	"""
1651	Helper/wrapper function for :func:`numpy.unique`.
1652
1653	Uses :func:`numpy.unique` to find the unique values for the array chunk.
1654	Given this chunk may not represent the whole array, also take the
1655	``indices`` and ``counts`` that are in 1-to-1 correspondence to ``ar``
1656	and reduce them in the same fashion as ``ar`` is reduced. Namely sum
1657	any counts that correspond to the same value and take the smallest
1658	index that corresponds to the same value.
1659
1660	To handle the inverse mapping from the unique values to the original
1661	array, simply return a NumPy array created with ``arange`` with enough
1662	values to correspond 1-to-1 to the unique values. While there is more
1663	work needed to be done to create the full inverse mapping for the
1664	original array, this provides enough information to generate the
1665	inverse mapping in Dask.
1666
1667	Given Dask likes to have one array returned from functions like
1668	``blockwise``, some formatting is done to stuff all of the resulting arrays
1669	into one big NumPy structured array. Dask is then able to handle this
1670	object and can split it apart into the separate results on the Dask side,
1671	which then can be passed back to this function in concatenated chunks for
1672	further reduction or can be return to the user to perform other forms of
1673	analysis.
1674
1675	By handling the problem in this way, it does not matter where a chunk
1676	is in a larger array or how big it is. The chunk can still be computed
1677	on the same way. Also it does not matter if the chunk is the result of
1678	other chunks being run through this function multiple times. The end
1679	result will still be just as accurate using this strategy.
1680	"""
1681
1682	return_index = indices is not None
1683	return_counts = counts is not None
1684
1685	u = np.unique(ar)
1686
1687	dt = [("values", u.dtype)]
1688	if return_index:
1689	dt.append(("indices", np.intp))
1690	if return_inverse:
1691	dt.append(("inverse", np.intp))
1692	if return_counts:
1693	dt.append(("counts", np.intp))
1694
1695	r = np.empty(u.shape, dtype=dt)
1696	r["values"] = u
1697	if return_inverse:
1698	r["inverse"] = np.arange(len(r), dtype=np.intp)
1699	if return_index or return_counts:
1700	for i, v in enumerate(r["values"]):
1701	m = ar == v
1702	if return_index:
1703	indices[m].min(keepdims=True, out=r["indices"][i : i + 1])
1704	if return_counts:
1705	counts[m].sum(keepdims=True, out=r["counts"][i : i + 1])
1706

Callers

nothing calls this directly

Calls 5

uniqueMethod · 0.45

emptyMethod · 0.45

arangeMethod · 0.45

minMethod · 0.45

sumMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…