Creates a :class:`~ray.data.Dataset` from JSON and JSONL files. For JSON file, the whole file is read as one row. For JSONL file, each line of file is read as separate row. Examples: Read a JSON file in remote storage. >>> import ray >>> ds = ray.data.read_json
(
paths: Union[str, List[str]],
*,
lines: bool = False,
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
parallelism: int = -1,
num_cpus: Optional[float] = None,
num_gpus: Optional[float] = None,
memory: Optional[float] = None,
ray_remote_args: Dict[str, Any] = None,
arrow_open_stream_args: Optional[Dict[str, Any]] = None,
partition_filter: Optional[PathPartitionFilter] = None,
partitioning: Partitioning = Partitioning("hive"),
include_paths: bool = False,
ignore_missing_paths: bool = False,
shuffle: Optional[Union[Literal["files"], FileShuffleConfig]] = None,
file_extensions: Optional[List[str]] = JSON_FILE_EXTENSIONS,
concurrency: Optional[int] = None,
override_num_blocks: Optional[int] = None,
**arrow_json_args,
)
| 1610 | |
| 1611 | @PublicAPI |
| 1612 | def read_json( |
| 1613 | paths: Union[str, List[str]], |
| 1614 | *, |
| 1615 | lines: bool = False, |
| 1616 | filesystem: Optional["pyarrow.fs.FileSystem"] = None, |
| 1617 | parallelism: int = -1, |
| 1618 | num_cpus: Optional[float] = None, |
| 1619 | num_gpus: Optional[float] = None, |
| 1620 | memory: Optional[float] = None, |
| 1621 | ray_remote_args: Dict[str, Any] = None, |
| 1622 | arrow_open_stream_args: Optional[Dict[str, Any]] = None, |
| 1623 | partition_filter: Optional[PathPartitionFilter] = None, |
| 1624 | partitioning: Partitioning = Partitioning("hive"), |
| 1625 | include_paths: bool = False, |
| 1626 | ignore_missing_paths: bool = False, |
| 1627 | shuffle: Optional[Union[Literal["files"], FileShuffleConfig]] = None, |
| 1628 | file_extensions: Optional[List[str]] = JSON_FILE_EXTENSIONS, |
| 1629 | concurrency: Optional[int] = None, |
| 1630 | override_num_blocks: Optional[int] = None, |
| 1631 | **arrow_json_args, |
| 1632 | ) -> Dataset: |
| 1633 | """Creates a :class:`~ray.data.Dataset` from JSON and JSONL files. |
| 1634 | |
| 1635 | For JSON file, the whole file is read as one row. |
| 1636 | For JSONL file, each line of file is read as separate row. |
| 1637 | |
| 1638 | Examples: |
| 1639 | Read a JSON file in remote storage. |
| 1640 | |
| 1641 | >>> import ray |
| 1642 | >>> ds = ray.data.read_json("s3://anonymous@ray-example-data/log.json") |
| 1643 | >>> ds.schema() |
| 1644 | Column Type |
| 1645 | ------ ---- |
| 1646 | timestamp timestamp[...] |
| 1647 | size int64 |
| 1648 | |
| 1649 | Read a JSONL file in remote storage. |
| 1650 | |
| 1651 | >>> ds = ray.data.read_json("s3://anonymous@ray-example-data/train.jsonl", lines=True) |
| 1652 | >>> ds.schema() |
| 1653 | Column Type |
| 1654 | ------ ---- |
| 1655 | input <class 'object'> |
| 1656 | |
| 1657 | Read multiple local files. |
| 1658 | |
| 1659 | >>> ray.data.read_json( # doctest: +SKIP |
| 1660 | ... ["local:///path/to/file1", "local:///path/to/file2"]) |
| 1661 | |
| 1662 | Read multiple directories. |
| 1663 | |
| 1664 | >>> ray.data.read_json( # doctest: +SKIP |
| 1665 | ... ["s3://bucket/path1", "s3://bucket/path2"]) |
| 1666 | |
| 1667 | By default, :meth:`~ray.data.read_json` parses |
| 1668 | `Hive-style partitions <https://athena.guide/articles/\ |
| 1669 | hive-style-partitioning/>`_ |
searching dependent graphs…