hub / github.com/ray-project/ray / write_numpy

Method write_numpy

python/ray/data/dataset.py:5094–5191 · view source on GitHub ↗

Writes a column of the :class:`~ray.data.Dataset` to .npy files. This is only supported for columns in the datasets that can be converted to NumPy arrays. The number of files is determined by the number of blocks in the dataset. To control the number of number of bl

(
        self,
        path: str,
        *,
        column: str,
        filesystem: Optional["pyarrow.fs.FileSystem"] = None,
        try_create_dir: bool = True,
        arrow_open_stream_args: Optional[Dict[str, Any]] = None,
        filename_provider: Optional[FilenameProvider] = None,
        min_rows_per_file: Optional[int] = None,
        ray_remote_args: Dict[str, Any] = None,
        concurrency: Optional[int] = None,
        num_rows_per_file: Optional[int] = None,
        mode: SaveMode = SaveMode.APPEND,
    )

Source from the content-addressed store, hash-verified

5092	@ConsumptionAPI
5093	@PublicAPI(api_group=IOC_API_GROUP)
5094	def write_numpy(
5095	self,
5096	path: str,
5097	*,
5098	column: str,
5099	filesystem: Optional["pyarrow.fs.FileSystem"] = None,
5100	try_create_dir: bool = True,
5101	arrow_open_stream_args: Optional[Dict[str, Any]] = None,
5102	filename_provider: Optional[FilenameProvider] = None,
5103	min_rows_per_file: Optional[int] = None,
5104	ray_remote_args: Dict[str, Any] = None,
5105	concurrency: Optional[int] = None,
5106	num_rows_per_file: Optional[int] = None,
5107	mode: SaveMode = SaveMode.APPEND,
5108	) -> None:
5109	"""Writes a column of the :class:`~ray.data.Dataset` to .npy files.
5110
5111	This is only supported for columns in the datasets that can be converted to
5112	NumPy arrays.
5113
5114	The number of files is determined by the number of blocks in the dataset.
5115	To control the number of number of blocks, call
5116	:meth:`~ray.data.Dataset.repartition`.
5117
5118
5119	By default, the format of the output files is ``{uuid}_{block_idx}.npy``,
5120	where ``uuid`` is a unique id for the dataset. To modify this behavior,
5121	implement a custom :class:`~ray.data.datasource.FilenameProvider`
5122	and pass it in as the ``filename_provider`` argument.
5123
5124	Examples:
5125	>>> import ray
5126	>>> ds = ray.data.range(100)
5127	>>> ds.write_numpy("local:///tmp/data/", column="id")
5128
5129	Time complexity: O(dataset size / parallelism)
5130
5131	Args:
5132	path: The path to the destination root directory, where
5133	the npy files are written to.
5134	column: The name of the column that contains the data to
5135	be written.
5136	filesystem: The pyarrow filesystem implementation to write to.
5137	These filesystems are specified in the
5138	`pyarrow docs <https://arrow.apache.org/docs\
5139	/python/api/filesystems.html#filesystem-implementations>`_.
5140	Specify this if you need to provide specific configurations to the
5141	filesystem. By default, the filesystem is automatically selected based
5142	on the scheme of the paths. For example, if the path begins with
5143	``s3://``, the ``S3FileSystem`` is used.
5144	try_create_dir: If ``True``, attempts to create all directories in
5145	destination path. Does nothing if all directories already
5146	exist. Defaults to ``True``.
5147	arrow_open_stream_args: kwargs passed to
5148	`pyarrow.fs.FileSystem.open_output_stream <https://arrow.apache.org\
5149	/docs/python/generated/pyarrow.fs.FileSystem.html\
5150	#pyarrow.fs.FileSystem.open_output_stream>`_, which is used when
5151	opening the file to write to.

Callers 2

test_numpy_roundtripFunction · 0.80

test_numpy_writeFunction · 0.80

Calls 3

write_datasinkMethod · 0.95

_validate_rows_per_file_argsFunction · 0.90

NumpyDatasinkClass · 0.90

Tested by 2

test_numpy_roundtripFunction · 0.64

test_numpy_writeFunction · 0.64