hub / github.com/ray-project/ray / map_groups

Method map_groups

python/ray/data/grouped_data.py:95–309 · view source on GitHub ↗

Apply the given function to each group of records of this dataset. While map_groups() is very flexible, note that it comes with downsides: * It may be slower than using more specific methods such as min(), max(). * It requires that each group fits in memory on a single node

(
        self,
        fn: UserDefinedFunction[DataBatch, DataBatch],
        *,
        zero_copy_batch: bool = True,
        compute: Union[str, ComputeStrategy] = None,
        batch_format: Optional[str] = "default",
        fn_args: Optional[Iterable[Any]] = None,
        fn_kwargs: Optional[Dict[str, Any]] = None,
        fn_constructor_args: Optional[Iterable[Any]] = None,
        fn_constructor_kwargs: Optional[Dict[str, Any]] = None,
        num_cpus: Optional[float] = None,
        num_gpus: Optional[float] = None,
        memory: Optional[float] = None,
        concurrency: Optional[Union[int, Tuple[int, int], Tuple[int, int, int]]] = None,
        ray_remote_args_fn: Optional[Callable[[], Dict[str, Any]]] = None,
        **ray_remote_args,
    )

Source from the content-addressed store, hash-verified

93
94	@PublicAPI(api_group=FA_API_GROUP)
95	def map_groups(
96	self,
97	fn: UserDefinedFunction[DataBatch, DataBatch],
98	*,
99	zero_copy_batch: bool = True,
100	compute: Union[str, ComputeStrategy] = None,
101	batch_format: Optional[str] = "default",
102	fn_args: Optional[Iterable[Any]] = None,
103	fn_kwargs: Optional[Dict[str, Any]] = None,
104	fn_constructor_args: Optional[Iterable[Any]] = None,
105	fn_constructor_kwargs: Optional[Dict[str, Any]] = None,
106	num_cpus: Optional[float] = None,
107	num_gpus: Optional[float] = None,
108	memory: Optional[float] = None,
109	concurrency: Optional[Union[int, Tuple[int, int], Tuple[int, int, int]]] = None,
110	ray_remote_args_fn: Optional[Callable[[], Dict[str, Any]]] = None,
111	**ray_remote_args,
112	) -> "Dataset":
113	"""Apply the given function to each group of records of this dataset.
114
115	While map_groups() is very flexible, note that it comes with downsides:
116
117	* It may be slower than using more specific methods such as min(), max().
118	* It requires that each group fits in memory on a single node.
119
120	In general, prefer to use `aggregate()` instead of `map_groups()`.
121
122	.. warning::
123	Specifying both ``num_cpus`` and ``num_gpus`` for map tasks is experimental,
124	and may result in scheduling or stability issues. Please
125	`report any issues <https://github.com/ray-project/ray/issues/new/choose>`_
126	to the Ray team.
127
128	Examples:
129	>>> # Return a single record per group (list of multiple records in,
130	>>> # list of a single record out).
131	>>> import ray
132	>>> import pandas as pd
133	>>> import numpy as np
134	>>> # Get first value per group.
135	>>> ds = ray.data.from_items([ # doctest: +SKIP
136	... {"group": 1, "value": 1},
137	... {"group": 1, "value": 2},
138	... {"group": 2, "value": 3},
139	... {"group": 2, "value": 4}])
140	>>> ds.groupby("group").map_groups( # doctest: +SKIP
141	... lambda g: {"result": np.array([g["value"][0]])})
142
143	>>> # Return multiple records per group (dataframe in, dataframe out).
144	>>> df = pd.DataFrame(
145	... {"A": ["a", "a", "b"], "B": [1, 1, 3], "C": [4, 6, 5]}
146	... )
147	>>> ds = ray.data.from_pandas(df) # doctest: +SKIP
148	>>> grouped = ds.groupby("A") # doctest: +SKIP
149	>>> grouped.map_groups( # doctest: +SKIP
150	... lambda g: g.apply(
151	... lambda c: c / g[c.name].sum() if c.name in ["B", "C"] else c
152	... )

Callers 15

with_columnMethod · 0.95

_stratified_train_test_splitMethod · 0.80

test_strict_convert_map_groupsFunction · 0.80

test_does_not_pushdown_limit_past_map_groups_by_defaultFunction · 0.80

test_shuffle_diagnostics.pyFile · 0.80

test_map_groups_with_gpusFunction · 0.80

test_map_groups_with_actorsFunction · 0.80

test_map_groups_with_actors_and_argsFunction · 0.80

test_groupby_large_udf_returnsFunction · 0.80

test_groupby_map_groups_for_none_groupkeyFunction · 0.80

test_groupby_map_groups_perfFunction · 0.80

test_groupby_map_groups_for_pandasFunction · 0.80

Calls 3

_map_batches_without_batch_size_validationMethod · 0.80

repartitionMethod · 0.45

sortMethod · 0.45

Tested by 15

test_strict_convert_map_groupsFunction · 0.64

test_does_not_pushdown_limit_past_map_groups_by_defaultFunction · 0.64

test_map_groups_with_gpusFunction · 0.64

test_map_groups_with_actorsFunction · 0.64

test_map_groups_with_actors_and_argsFunction · 0.64

test_groupby_large_udf_returnsFunction · 0.64

test_groupby_map_groups_for_none_groupkeyFunction · 0.64

test_groupby_map_groups_perfFunction · 0.64

test_groupby_map_groups_for_pandasFunction · 0.64

test_groupby_map_groups_for_arrowFunction · 0.64

test_groupby_map_groups_for_numpyFunction · 0.64

test_groupby_map_groups_with_different_typesFunction · 0.64