MCPcopy
hub / github.com/ray-project/ray / repartition

Method repartition

python/ray/data/dataset.py:1756–1893  ·  view source on GitHub ↗

Repartition the :class:`Dataset` into exactly this number of :ref:`blocks `. This method can be useful to tune the performance of your pipeline. To learn more, see :ref:`Advanced: Performance Tips and Tuning `. If you're writi

(
        self,
        num_blocks: Optional[int] = None,
        target_num_rows_per_block: Optional[int] = None,
        *,
        strict: bool = False,
        shuffle: bool = False,
        keys: Optional[List[str]] = None,
        sort: bool = False,
    )

Source from the content-addressed store, hash-verified

1754
1755 @PublicAPI(api_group=SSR_API_GROUP)
1756 def repartition(
1757 self,
1758 num_blocks: Optional[int] = None,
1759 target_num_rows_per_block: Optional[int] = None,
1760 *,
1761 strict: bool = False,
1762 shuffle: bool = False,
1763 keys: Optional[List[str]] = None,
1764 sort: bool = False,
1765 ) -> "Dataset":
1766 """Repartition the :class:`Dataset` into exactly this number of
1767 :ref:`blocks <dataset_concept>`.
1768
1769 This method can be useful to tune the performance of your pipeline. To learn
1770 more, see :ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`.
1771
1772 If you&#x27;re writing data to files, you can also use this method to change the
1773 number of output files. To learn more, see
1774 :ref:`Changing the number of output files <changing-number-output-files>`.
1775
1776 .. note::
1777
1778 Repartition has three modes:
1779
1780 * When ``num_blocks`` and ``shuffle=True`` are specified Ray Data performs a full distributed shuffle producing exactly ``num_blocks`` blocks.
1781 * When ``num_blocks`` and ``shuffle=False`` are specified, Ray Data does NOT perform full shuffle, instead opting in for splitting and combining of the blocks attempting to minimize the necessary data movement (relative to full-blown shuffle). Exactly ``num_blocks`` will be produced.
1782 * If ``target_num_rows_per_block`` is set (exclusive with ``num_blocks`` and ``shuffle``), streaming repartitioning will be executed, where blocks will be made to carry no more than ``target_num_rows_per_block`` rows. Smaller blocks will be combined into bigger ones up to ``target_num_rows_per_block`` as well.
1783
1784 .. image:: /data/images/dataset-shuffle.svg
1785 :align: center
1786
1787 ..
1788 https://docs.google.com/drawings/d/132jhE3KXZsf29ho1yUdPrCHB9uheHBWHJhDQMXqIVPA/edit
1789
1790 Examples:
1791 >>> import ray
1792 >>> ds = ray.data.range(100).repartition(10).materialize()
1793 >>> ds.num_blocks()
1794 10
1795
1796 Time complexity: O(dataset size / parallelism)
1797
1798 Args:
1799 num_blocks: Number of blocks after repartitioning.
1800 target_num_rows_per_block: [Experimental] The target number of rows per block to
1801 repartition. Performs streaming repartitioning of the dataset (no shuffling).
1802 Note that either `num_blocks` or
1803 `target_num_rows_per_block` must be set, but not both. When
1804 `target_num_rows_per_block` is set, it only repartitions
1805 :class:`Dataset` :ref:`blocks <dataset_concept>` that are larger than
1806 `target_num_rows_per_block`. Note that the system will internally
1807 figure out the number of rows per :ref:`blocks <dataset_concept>` for
1808 optimal execution, based on the `target_num_rows_per_block`. This is
1809 the current behavior because of the implementation and may change in
1810 the future.
1811 strict: If ``True``, ``repartition`` guarantees that all output blocks,
1812 except for the last one, will have exactly ``target_num_rows_per_block`` rows.
1813 If ``False``, ``repartition`` uses best-effort bundling and may produce at most

Calls 4

RepartitionClass · 0.90
LogicalPlanClass · 0.90
_from_parentMethod · 0.80