MCPcopy
hub / github.com/ray-project/ray / read_json

Function read_json

python/ray/data/read_api.py:1612–1782  ·  view source on GitHub ↗

Creates a :class:`~ray.data.Dataset` from JSON and JSONL files. For JSON file, the whole file is read as one row. For JSONL file, each line of file is read as separate row. Examples: Read a JSON file in remote storage. >>> import ray >>> ds = ray.data.read_json

(
    paths: Union[str, List[str]],
    *,
    lines: bool = False,
    filesystem: Optional["pyarrow.fs.FileSystem"] = None,
    parallelism: int = -1,
    num_cpus: Optional[float] = None,
    num_gpus: Optional[float] = None,
    memory: Optional[float] = None,
    ray_remote_args: Dict[str, Any] = None,
    arrow_open_stream_args: Optional[Dict[str, Any]] = None,
    partition_filter: Optional[PathPartitionFilter] = None,
    partitioning: Partitioning = Partitioning("hive"),
    include_paths: bool = False,
    ignore_missing_paths: bool = False,
    shuffle: Optional[Union[Literal["files"], FileShuffleConfig]] = None,
    file_extensions: Optional[List[str]] = JSON_FILE_EXTENSIONS,
    concurrency: Optional[int] = None,
    override_num_blocks: Optional[int] = None,
    **arrow_json_args,
)

Source from the content-addressed store, hash-verified

1610
1611@PublicAPI
1612def read_json(
1613 paths: Union[str, List[str]],
1614 *,
1615 lines: bool = False,
1616 filesystem: Optional["pyarrow.fs.FileSystem"] = None,
1617 parallelism: int = -1,
1618 num_cpus: Optional[float] = None,
1619 num_gpus: Optional[float] = None,
1620 memory: Optional[float] = None,
1621 ray_remote_args: Dict[str, Any] = None,
1622 arrow_open_stream_args: Optional[Dict[str, Any]] = None,
1623 partition_filter: Optional[PathPartitionFilter] = None,
1624 partitioning: Partitioning = Partitioning("hive"),
1625 include_paths: bool = False,
1626 ignore_missing_paths: bool = False,
1627 shuffle: Optional[Union[Literal["files"], FileShuffleConfig]] = None,
1628 file_extensions: Optional[List[str]] = JSON_FILE_EXTENSIONS,
1629 concurrency: Optional[int] = None,
1630 override_num_blocks: Optional[int] = None,
1631 **arrow_json_args,
1632) -> Dataset:
1633 """Creates a :class:`~ray.data.Dataset` from JSON and JSONL files.
1634
1635 For JSON file, the whole file is read as one row.
1636 For JSONL file, each line of file is read as separate row.
1637
1638 Examples:
1639 Read a JSON file in remote storage.
1640
1641 >>> import ray
1642 >>> ds = ray.data.read_json("s3://anonymous@ray-example-data/log.json")
1643 >>> ds.schema()
1644 Column Type
1645 ------ ----
1646 timestamp timestamp[...]
1647 size int64
1648
1649 Read a JSONL file in remote storage.
1650
1651 >>> ds = ray.data.read_json("s3://anonymous@ray-example-data/train.jsonl", lines=True)
1652 >>> ds.schema()
1653 Column Type
1654 ------ ----
1655 input <class 'object'>
1656
1657 Read multiple local files.
1658
1659 >>> ray.data.read_json( # doctest: +SKIP
1660 ... ["local:///path/to/file1", "local:///path/to/file2"])
1661
1662 Read multiple directories.
1663
1664 >>> ray.data.read_json( # doctest: +SKIP
1665 ... ["s3://bucket/path1", "s3://bucket/path2"])
1666
1667 By default, :meth:`~ray.data.read_json` parses
1668 `Hive-style partitions <https://athena.guide/articles/\
1669 hive-style-partitioning/>`_

Callers 1

setUpClassMethod · 0.90

Calls 7

PartitioningClass · 0.90
ArrowJSONDatasourceClass · 0.90
read_datasourceFunction · 0.85
itemsMethod · 0.45
get_currentMethod · 0.45

Tested by 1

setUpClassMethod · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…