Randomly shuffle the rows of this :class:`Dataset`. .. tip:: This method can be slow. For better performance, try :ref:`Iterating over batches with shuffling `. Also, see :ref:`Optimizing shuffles <optimizing_shuffl
(
self,
*,
seed: Optional[int | RandomSeedConfig] = None,
num_blocks: Optional[int] = None,
**ray_remote_args,
)
| 1895 | @AllToAllAPI |
| 1896 | @PublicAPI(api_group=SSR_API_GROUP) |
| 1897 | def random_shuffle( |
| 1898 | self, |
| 1899 | *, |
| 1900 | seed: Optional[int | RandomSeedConfig] = None, |
| 1901 | num_blocks: Optional[int] = None, |
| 1902 | **ray_remote_args, |
| 1903 | ) -> "Dataset": |
| 1904 | """Randomly shuffle the rows of this :class:`Dataset`. |
| 1905 | |
| 1906 | .. tip:: |
| 1907 | |
| 1908 | This method can be slow. For better performance, try |
| 1909 | :ref:`Iterating over batches with shuffling <iterating-over-batches-with-shuffling>`. |
| 1910 | Also, see :ref:`Optimizing shuffles <optimizing_shuffles>`. |
| 1911 | |
| 1912 | Examples: |
| 1913 | >>> import ray |
| 1914 | >>> from ray.data import RandomSeedConfig |
| 1915 | >>> ds = ray.data.range(100) |
| 1916 | >>> ds.random_shuffle().take(3) # doctest: +SKIP |
| 1917 | [{'id': 41}, {'id': 21}, {'id': 92}] |
| 1918 | >>> ds.random_shuffle(seed=42).take(3) # doctest: +SKIP |
| 1919 | [{'id': 24}, {'id': 97}, {'id': 17}] |
| 1920 | |
| 1921 | Fully deterministic across executions: |
| 1922 | >>> ds = ray.data.range(100) |
| 1923 | >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=False)).take(3) # doctest: +SKIP |
| 1924 | [{'id': 24}, {'id': 97}, {'id': 17}] |
| 1925 | >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=False)).take(3) # doctest: +SKIP |
| 1926 | [{'id': 24}, {'id': 97}, {'id': 17}] |
| 1927 | |
| 1928 | Reproducible but non-deterministic across executions (e.g., training epochs): |
| 1929 | >>> ds = ray.data.range(100) |
| 1930 | >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(3) # doctest: +SKIP |
| 1931 | [{'id': 29}, {'id': 79}, {'id': 39}] |
| 1932 | >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(3) # doctest: +SKIP |
| 1933 | [{'id': 40}, {'id': 7}, {'id': 90}] |
| 1934 | |
| 1935 | Time complexity: O(dataset size / parallelism) |
| 1936 | |
| 1937 | Args: |
| 1938 | seed: An optional random seed. Can be an integer or a :class:`RandomSeedConfig` |
| 1939 | object. If an integer is provided, it defaults to fully deterministic |
| 1940 | behavior (same shuffle order across executions). If None, the shuffle |
| 1941 | is non-deterministic. See :class:`RandomSeedConfig` for more details on seed behavior. |
| 1942 | num_blocks: This parameter is deprecated. It was previously intended to |
| 1943 | specify the number of output blocks in the shuffled dataset, but is no |
| 1944 | longer supported. To control the number of output blocks, use |
| 1945 | :meth:`Dataset.repartition` after shuffling instead. |
| 1946 | **ray_remote_args: Additional resource requirements to request from |
| 1947 | Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See |
| 1948 | :func:`ray.remote` for details. |
| 1949 | |
| 1950 | Returns: |
| 1951 | The shuffled :class:`Dataset`. |
| 1952 | """ # noqa: E501 |
| 1953 | |
| 1954 | if num_blocks is not None: |