Repartition the :class:`Dataset` into exactly this number of :ref:`blocks `. This method can be useful to tune the performance of your pipeline. To learn more, see :ref:`Advanced: Performance Tips and Tuning `. If you're writi
(
self,
num_blocks: Optional[int] = None,
target_num_rows_per_block: Optional[int] = None,
*,
strict: bool = False,
shuffle: bool = False,
keys: Optional[List[str]] = None,
sort: bool = False,
)
| 1754 | |
| 1755 | @PublicAPI(api_group=SSR_API_GROUP) |
| 1756 | def repartition( |
| 1757 | self, |
| 1758 | num_blocks: Optional[int] = None, |
| 1759 | target_num_rows_per_block: Optional[int] = None, |
| 1760 | *, |
| 1761 | strict: bool = False, |
| 1762 | shuffle: bool = False, |
| 1763 | keys: Optional[List[str]] = None, |
| 1764 | sort: bool = False, |
| 1765 | ) -> "Dataset": |
| 1766 | """Repartition the :class:`Dataset` into exactly this number of |
| 1767 | :ref:`blocks <dataset_concept>`. |
| 1768 | |
| 1769 | This method can be useful to tune the performance of your pipeline. To learn |
| 1770 | more, see :ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`. |
| 1771 | |
| 1772 | If you're writing data to files, you can also use this method to change the |
| 1773 | number of output files. To learn more, see |
| 1774 | :ref:`Changing the number of output files <changing-number-output-files>`. |
| 1775 | |
| 1776 | .. note:: |
| 1777 | |
| 1778 | Repartition has three modes: |
| 1779 | |
| 1780 | * When ``num_blocks`` and ``shuffle=True`` are specified Ray Data performs a full distributed shuffle producing exactly ``num_blocks`` blocks. |
| 1781 | * When ``num_blocks`` and ``shuffle=False`` are specified, Ray Data does NOT perform full shuffle, instead opting in for splitting and combining of the blocks attempting to minimize the necessary data movement (relative to full-blown shuffle). Exactly ``num_blocks`` will be produced. |
| 1782 | * If ``target_num_rows_per_block`` is set (exclusive with ``num_blocks`` and ``shuffle``), streaming repartitioning will be executed, where blocks will be made to carry no more than ``target_num_rows_per_block`` rows. Smaller blocks will be combined into bigger ones up to ``target_num_rows_per_block`` as well. |
| 1783 | |
| 1784 | .. image:: /data/images/dataset-shuffle.svg |
| 1785 | :align: center |
| 1786 | |
| 1787 | .. |
| 1788 | https://docs.google.com/drawings/d/132jhE3KXZsf29ho1yUdPrCHB9uheHBWHJhDQMXqIVPA/edit |
| 1789 | |
| 1790 | Examples: |
| 1791 | >>> import ray |
| 1792 | >>> ds = ray.data.range(100).repartition(10).materialize() |
| 1793 | >>> ds.num_blocks() |
| 1794 | 10 |
| 1795 | |
| 1796 | Time complexity: O(dataset size / parallelism) |
| 1797 | |
| 1798 | Args: |
| 1799 | num_blocks: Number of blocks after repartitioning. |
| 1800 | target_num_rows_per_block: [Experimental] The target number of rows per block to |
| 1801 | repartition. Performs streaming repartitioning of the dataset (no shuffling). |
| 1802 | Note that either `num_blocks` or |
| 1803 | `target_num_rows_per_block` must be set, but not both. When |
| 1804 | `target_num_rows_per_block` is set, it only repartitions |
| 1805 | :class:`Dataset` :ref:`blocks <dataset_concept>` that are larger than |
| 1806 | `target_num_rows_per_block`. Note that the system will internally |
| 1807 | figure out the number of rows per :ref:`blocks <dataset_concept>` for |
| 1808 | optimal execution, based on the `target_num_rows_per_block`. This is |
| 1809 | the current behavior because of the implementation and may change in |
| 1810 | the future. |
| 1811 | strict: If ``True``, ``repartition`` guarantees that all output blocks, |
| 1812 | except for the last one, will have exactly ``target_num_rows_per_block`` rows. |
| 1813 | If ``False``, ``repartition`` uses best-effort bundling and may produce at most |