hub / github.com/ray-project/ray / read_json

Function read_json

python/ray/data/read_api.py:1612–1782 · view source on GitHub ↗

Creates a :class:`~ray.data.Dataset` from JSON and JSONL files. For JSON file, the whole file is read as one row. For JSONL file, each line of file is read as separate row. Examples: Read a JSON file in remote storage. >>> import ray >>> ds = ray.data.read_json

(
    paths: Union[str, List[str]],
    *,
    lines: bool = False,
    filesystem: Optional["pyarrow.fs.FileSystem"] = None,
    parallelism: int = -1,
    num_cpus: Optional[float] = None,
    num_gpus: Optional[float] = None,
    memory: Optional[float] = None,
    ray_remote_args: Dict[str, Any] = None,
    arrow_open_stream_args: Optional[Dict[str, Any]] = None,
    partition_filter: Optional[PathPartitionFilter] = None,
    partitioning: Partitioning = Partitioning("hive"),
    include_paths: bool = False,
    ignore_missing_paths: bool = False,
    shuffle: Optional[Union[Literal["files"], FileShuffleConfig]] = None,
    file_extensions: Optional[List[str]] = JSON_FILE_EXTENSIONS,
    concurrency: Optional[int] = None,
    override_num_blocks: Optional[int] = None,
    **arrow_json_args,
)

Source from the content-addressed store, hash-verified

1610
1611	@PublicAPI
1612	def read_json(
1613	paths: Union[str, List[str]],
1614	*,
1615	lines: bool = False,
1616	filesystem: Optional["pyarrow.fs.FileSystem"] = None,
1617	parallelism: int = -1,
1618	num_cpus: Optional[float] = None,
1619	num_gpus: Optional[float] = None,
1620	memory: Optional[float] = None,
1621	ray_remote_args: Dict[str, Any] = None,
1622	arrow_open_stream_args: Optional[Dict[str, Any]] = None,
1623	partition_filter: Optional[PathPartitionFilter] = None,
1624	partitioning: Partitioning = Partitioning("hive"),
1625	include_paths: bool = False,
1626	ignore_missing_paths: bool = False,
1627	shuffle: Optional[Union[Literal["files"], FileShuffleConfig]] = None,
1628	file_extensions: Optional[List[str]] = JSON_FILE_EXTENSIONS,
1629	concurrency: Optional[int] = None,
1630	override_num_blocks: Optional[int] = None,
1631	**arrow_json_args,
1632	) -> Dataset:
1633	"""Creates a :class:`~ray.data.Dataset` from JSON and JSONL files.
1634
1635	For JSON file, the whole file is read as one row.
1636	For JSONL file, each line of file is read as separate row.
1637
1638	Examples:
1639	Read a JSON file in remote storage.
1640
1641	>>> import ray
1642	>>> ds = ray.data.read_json("s3://anonymous@ray-example-data/log.json")
1643	>>> ds.schema()
1644	Column Type
1645	------ ----
1646	timestamp timestamp[...]
1647	size int64
1648
1649	Read a JSONL file in remote storage.
1650
1651	>>> ds = ray.data.read_json("s3://anonymous@ray-example-data/train.jsonl", lines=True)
1652	>>> ds.schema()
1653	Column Type
1654	------ ----
1655	input <class 'object'>
1656
1657	Read multiple local files.
1658
1659	>>> ray.data.read_json( # doctest: +SKIP
1660	... ["local:///path/to/file1", "local:///path/to/file2"])
1661
1662	Read multiple directories.
1663
1664	>>> ray.data.read_json( # doctest: +SKIP
1665	... ["s3://bucket/path1", "s3://bucket/path2"])
1666
1667	By default, :meth:`~ray.data.read_json` parses
1668	`Hive-style partitions <https://athena.guide/articles/\
1669	hive-style-partitioning/>`_

Callers 1

setUpClassMethod · 0.90

Calls 7

PartitioningClass · 0.90

DefaultFileMetadataProviderClass · 0.90

PandasJSONDatasourceClass · 0.90

ArrowJSONDatasourceClass · 0.90

read_datasourceFunction · 0.85

itemsMethod · 0.45

get_currentMethod · 0.45

Tested by 1

setUpClassMethod · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…