hub / github.com/ray-project/ray / read_bigquery

Function read_bigquery

python/ray/data/read_api.py:1040–1124 · view source on GitHub ↗

Create a dataset from BigQuery. The data to read from is specified via the ``project_id``, ``dataset`` and/or ``query`` parameters. The dataset is created from the results of executing ``query`` if a query is provided. Otherwise, the entire ``dataset`` is read. For more informa

(
    project_id: str,
    dataset: Optional[str] = None,
    query: Optional[str] = None,
    *,
    parallelism: int = -1,
    num_cpus: Optional[float] = None,
    num_gpus: Optional[float] = None,
    memory: Optional[float] = None,
    ray_remote_args: Dict[str, Any] = None,
    concurrency: Optional[int] = None,
    override_num_blocks: Optional[int] = None,
)

Source from the content-addressed store, hash-verified

1038
1039	@PublicAPI(stability="alpha")
1040	def read_bigquery(
1041	project_id: str,
1042	dataset: Optional[str] = None,
1043	query: Optional[str] = None,
1044	*,
1045	parallelism: int = -1,
1046	num_cpus: Optional[float] = None,
1047	num_gpus: Optional[float] = None,
1048	memory: Optional[float] = None,
1049	ray_remote_args: Dict[str, Any] = None,
1050	concurrency: Optional[int] = None,
1051	override_num_blocks: Optional[int] = None,
1052	) -> Dataset:
1053	"""Create a dataset from BigQuery.
1054
1055	The data to read from is specified via the ``project_id``, ``dataset``
1056	and/or ``query`` parameters. The dataset is created from the results of
1057	executing ``query`` if a query is provided. Otherwise, the entire
1058	``dataset`` is read.
1059
1060	For more information about BigQuery, see the following concepts:
1061
1062	- Project id: `Creating and Managing Projects <https://cloud.google.com/resource-manager/docs/creating-managing-projects>`_
1063
1064	- Dataset: `Datasets Intro <https://cloud.google.com/bigquery/docs/datasets-intro>`_
1065
1066	- Query: `Query Syntax <https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax>`_
1067
1068	This method uses the BigQuery Storage Read API which reads in parallel,
1069	with a Ray read task to handle each stream. The number of streams is
1070	determined by ``parallelism`` which can be requested from this interface
1071	or automatically chosen if unspecified (see the ``parallelism`` arg below).
1072
1073	.. warning::
1074	The maximum query response size is 10GB.
1075
1076	Examples:
1077	.. testcode::
1078	:skipif: True
1079
1080	import ray
1081	# Users will need to authenticate beforehand (https://cloud.google.com/sdk/gcloud/reference/auth/login)
1082	ds = ray.data.read_bigquery(
1083	project_id="my_project",
1084	query="SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 1000",
1085	)
1086
1087	Args:
1088	project_id: The name of the associated Google Cloud Project that hosts the dataset to read.
1089	For more information, see `Creating and Managing Projects <https://cloud.google.com/resource-manager/docs/creating-managing-projects>`_.
1090	dataset: The name of the dataset hosted in BigQuery in the format of ``dataset_id.table_id``.
1091	Both the dataset_id and table_id must exist otherwise an exception will be raised.
1092	query: The SQL query to execute. `query` and `dataset` are mutually exclusive.
1093	If `query` is provided, the query result is read as the dataset.
1094	parallelism: This argument is deprecated. Use ``override_num_blocks`` argument.
1095	num_cpus: The number of CPUs to reserve for each parallel read worker.
1096	num_gpus: The number of GPUs to reserve for each parallel read worker. For
1097	example, specify `num_gpus=1` to request 1 GPU for each parallel read

Callers

nothing calls this directly

Calls 2

BigQueryDatasourceClass · 0.90

read_datasourceFunction · 0.85

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…