hub / github.com/ray-project/ray / repartition

Method repartition

python/ray/data/dataset.py:1756–1893 · view source on GitHub ↗

Repartition the :class:`Dataset` into exactly this number of :ref:`blocks `. This method can be useful to tune the performance of your pipeline. To learn more, see :ref:`Advanced: Performance Tips and Tuning `. If you're writi

(
        self,
        num_blocks: Optional[int] = None,
        target_num_rows_per_block: Optional[int] = None,
        *,
        strict: bool = False,
        shuffle: bool = False,
        keys: Optional[List[str]] = None,
        sort: bool = False,
    )

Source from the content-addressed store, hash-verified

1754
1755	@PublicAPI(api_group=SSR_API_GROUP)
1756	def repartition(
1757	self,
1758	num_blocks: Optional[int] = None,
1759	target_num_rows_per_block: Optional[int] = None,
1760	*,
1761	strict: bool = False,
1762	shuffle: bool = False,
1763	keys: Optional[List[str]] = None,
1764	sort: bool = False,
1765	) -> "Dataset":
1766	"""Repartition the :class:`Dataset` into exactly this number of
1767	:ref:`blocks <dataset_concept>`.
1768
1769	This method can be useful to tune the performance of your pipeline. To learn
1770	more, see :ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`.
1771
1772	If you're writing data to files, you can also use this method to change the
1773	number of output files. To learn more, see
1774	:ref:`Changing the number of output files <changing-number-output-files>`.
1775
1776	.. note::
1777
1778	Repartition has three modes:
1779
1780	* When ``num_blocks`` and ``shuffle=True`` are specified Ray Data performs a full distributed shuffle producing exactly ``num_blocks`` blocks.
1781	* When ``num_blocks`` and ``shuffle=False`` are specified, Ray Data does NOT perform full shuffle, instead opting in for splitting and combining of the blocks attempting to minimize the necessary data movement (relative to full-blown shuffle). Exactly ``num_blocks`` will be produced.
1782	* If ``target_num_rows_per_block`` is set (exclusive with ``num_blocks`` and ``shuffle``), streaming repartitioning will be executed, where blocks will be made to carry no more than ``target_num_rows_per_block`` rows. Smaller blocks will be combined into bigger ones up to ``target_num_rows_per_block`` as well.
1783
1784	.. image:: /data/images/dataset-shuffle.svg
1785	:align: center
1786
1787	..
1788	https://docs.google.com/drawings/d/132jhE3KXZsf29ho1yUdPrCHB9uheHBWHJhDQMXqIVPA/edit
1789
1790	Examples:
1791	>>> import ray
1792	>>> ds = ray.data.range(100).repartition(10).materialize()
1793	>>> ds.num_blocks()
1794	10
1795
1796	Time complexity: O(dataset size / parallelism)
1797
1798	Args:
1799	num_blocks: Number of blocks after repartitioning.
1800	target_num_rows_per_block: [Experimental] The target number of rows per block to
1801	repartition. Performs streaming repartitioning of the dataset (no shuffling).
1802	Note that either `num_blocks` or
1803	`target_num_rows_per_block` must be set, but not both. When
1804	`target_num_rows_per_block` is set, it only repartitions
1805	:class:`Dataset` :ref:`blocks <dataset_concept>` that are larger than
1806	`target_num_rows_per_block`. Note that the system will internally
1807	figure out the number of rows per :ref:`blocks <dataset_concept>` for
1808	optimal execution, based on the `target_num_rows_per_block`. This is
1809	the current behavior because of the implementation and may change in
1810	the future.
1811	strict: If ``True``, ``repartition`` guarantees that all output blocks,
1812	except for the last one, will have exactly ``target_num_rows_per_block`` rows.
1813	If ``False``, ``repartition`` uses best-effort bundling and may produce at most

Callers 15

map_groupsMethod · 0.45

load_checkpointMethod · 0.45

test_global_tabular_sumFunction · 0.45

test_random_block_orderFunction · 0.45

test_random_shuffleFunction · 0.45

test_iter_batches_empty_blockFunction · 0.45

test_read_map_batches_operator_fusion_with_random_shuffle_operatorFunction · 0.45

test_read_map_batches_operator_fusion_with_repartition_operatorFunction · 0.45

test_streaming_repartition_map_batches_fusion_order_and_paramsFunction · 0.45

test_streaming_repartition_no_further_fuseFunction · 0.45

test_streaming_repartition_multiple_fusion_non_strictFunction · 0.45

test_combine_repartition_aggregateFunction · 0.45

Calls 4

StreamingRepartitionClass · 0.90

RepartitionClass · 0.90

LogicalPlanClass · 0.90

_from_parentMethod · 0.80

Tested by 15

test_global_tabular_sumFunction · 0.36

test_random_block_orderFunction · 0.36

test_random_shuffleFunction · 0.36

test_iter_batches_empty_blockFunction · 0.36

test_read_map_batches_operator_fusion_with_random_shuffle_operatorFunction · 0.36

test_read_map_batches_operator_fusion_with_repartition_operatorFunction · 0.36

test_streaming_repartition_map_batches_fusion_order_and_paramsFunction · 0.36

test_streaming_repartition_no_further_fuseFunction · 0.36

test_streaming_repartition_multiple_fusion_non_strictFunction · 0.36

test_combine_repartition_aggregateFunction · 0.36

test_combine_streaming_repartition_to_shuffle_repartitionFunction · 0.36

test_hash_shuffleMethod · 0.36