Writes a column of the :class:`~ray.data.Dataset` to .npy files. This is only supported for columns in the datasets that can be converted to NumPy arrays. The number of files is determined by the number of blocks in the dataset. To control the number of number of bl
(
self,
path: str,
*,
column: str,
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
try_create_dir: bool = True,
arrow_open_stream_args: Optional[Dict[str, Any]] = None,
filename_provider: Optional[FilenameProvider] = None,
min_rows_per_file: Optional[int] = None,
ray_remote_args: Dict[str, Any] = None,
concurrency: Optional[int] = None,
num_rows_per_file: Optional[int] = None,
mode: SaveMode = SaveMode.APPEND,
)
| 5092 | @ConsumptionAPI |
| 5093 | @PublicAPI(api_group=IOC_API_GROUP) |
| 5094 | def write_numpy( |
| 5095 | self, |
| 5096 | path: str, |
| 5097 | *, |
| 5098 | column: str, |
| 5099 | filesystem: Optional["pyarrow.fs.FileSystem"] = None, |
| 5100 | try_create_dir: bool = True, |
| 5101 | arrow_open_stream_args: Optional[Dict[str, Any]] = None, |
| 5102 | filename_provider: Optional[FilenameProvider] = None, |
| 5103 | min_rows_per_file: Optional[int] = None, |
| 5104 | ray_remote_args: Dict[str, Any] = None, |
| 5105 | concurrency: Optional[int] = None, |
| 5106 | num_rows_per_file: Optional[int] = None, |
| 5107 | mode: SaveMode = SaveMode.APPEND, |
| 5108 | ) -> None: |
| 5109 | """Writes a column of the :class:`~ray.data.Dataset` to .npy files. |
| 5110 | |
| 5111 | This is only supported for columns in the datasets that can be converted to |
| 5112 | NumPy arrays. |
| 5113 | |
| 5114 | The number of files is determined by the number of blocks in the dataset. |
| 5115 | To control the number of number of blocks, call |
| 5116 | :meth:`~ray.data.Dataset.repartition`. |
| 5117 | |
| 5118 | |
| 5119 | By default, the format of the output files is ``{uuid}_{block_idx}.npy``, |
| 5120 | where ``uuid`` is a unique id for the dataset. To modify this behavior, |
| 5121 | implement a custom :class:`~ray.data.datasource.FilenameProvider` |
| 5122 | and pass it in as the ``filename_provider`` argument. |
| 5123 | |
| 5124 | Examples: |
| 5125 | >>> import ray |
| 5126 | >>> ds = ray.data.range(100) |
| 5127 | >>> ds.write_numpy("local:///tmp/data/", column="id") |
| 5128 | |
| 5129 | Time complexity: O(dataset size / parallelism) |
| 5130 | |
| 5131 | Args: |
| 5132 | path: The path to the destination root directory, where |
| 5133 | the npy files are written to. |
| 5134 | column: The name of the column that contains the data to |
| 5135 | be written. |
| 5136 | filesystem: The pyarrow filesystem implementation to write to. |
| 5137 | These filesystems are specified in the |
| 5138 | `pyarrow docs <https://arrow.apache.org/docs\ |
| 5139 | /python/api/filesystems.html#filesystem-implementations>`_. |
| 5140 | Specify this if you need to provide specific configurations to the |
| 5141 | filesystem. By default, the filesystem is automatically selected based |
| 5142 | on the scheme of the paths. For example, if the path begins with |
| 5143 | ``s3://``, the ``S3FileSystem`` is used. |
| 5144 | try_create_dir: If ``True``, attempts to create all directories in |
| 5145 | destination path. Does nothing if all directories already |
| 5146 | exist. Defaults to ``True``. |
| 5147 | arrow_open_stream_args: kwargs passed to |
| 5148 | `pyarrow.fs.FileSystem.open_output_stream <https://arrow.apache.org\ |
| 5149 | /docs/python/generated/pyarrow.fs.FileSystem.html\ |
| 5150 | #pyarrow.fs.FileSystem.open_output_stream>`_, which is used when |
| 5151 | opening the file to write to. |