MCPcopy
hub / github.com/ray-project/ray / read_bigquery

Function read_bigquery

python/ray/data/read_api.py:1040–1124  ·  view source on GitHub ↗

Create a dataset from BigQuery. The data to read from is specified via the ``project_id``, ``dataset`` and/or ``query`` parameters. The dataset is created from the results of executing ``query`` if a query is provided. Otherwise, the entire ``dataset`` is read. For more informa

(
    project_id: str,
    dataset: Optional[str] = None,
    query: Optional[str] = None,
    *,
    parallelism: int = -1,
    num_cpus: Optional[float] = None,
    num_gpus: Optional[float] = None,
    memory: Optional[float] = None,
    ray_remote_args: Dict[str, Any] = None,
    concurrency: Optional[int] = None,
    override_num_blocks: Optional[int] = None,
)

Source from the content-addressed store, hash-verified

1038
1039@PublicAPI(stability="alpha")
1040def read_bigquery(
1041 project_id: str,
1042 dataset: Optional[str] = None,
1043 query: Optional[str] = None,
1044 *,
1045 parallelism: int = -1,
1046 num_cpus: Optional[float] = None,
1047 num_gpus: Optional[float] = None,
1048 memory: Optional[float] = None,
1049 ray_remote_args: Dict[str, Any] = None,
1050 concurrency: Optional[int] = None,
1051 override_num_blocks: Optional[int] = None,
1052) -> Dataset:
1053 """Create a dataset from BigQuery.
1054
1055 The data to read from is specified via the ``project_id``, ``dataset``
1056 and/or ``query`` parameters. The dataset is created from the results of
1057 executing ``query`` if a query is provided. Otherwise, the entire
1058 ``dataset`` is read.
1059
1060 For more information about BigQuery, see the following concepts:
1061
1062 - Project id: `Creating and Managing Projects <https://cloud.google.com/resource-manager/docs/creating-managing-projects>`_
1063
1064 - Dataset: `Datasets Intro <https://cloud.google.com/bigquery/docs/datasets-intro>`_
1065
1066 - Query: `Query Syntax <https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax>`_
1067
1068 This method uses the BigQuery Storage Read API which reads in parallel,
1069 with a Ray read task to handle each stream. The number of streams is
1070 determined by ``parallelism`` which can be requested from this interface
1071 or automatically chosen if unspecified (see the ``parallelism`` arg below).
1072
1073 .. warning::
1074 The maximum query response size is 10GB.
1075
1076 Examples:
1077 .. testcode::
1078 :skipif: True
1079
1080 import ray
1081 # Users will need to authenticate beforehand (https://cloud.google.com/sdk/gcloud/reference/auth/login)
1082 ds = ray.data.read_bigquery(
1083 project_id="my_project",
1084 query="SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 1000",
1085 )
1086
1087 Args:
1088 project_id: The name of the associated Google Cloud Project that hosts the dataset to read.
1089 For more information, see `Creating and Managing Projects <https://cloud.google.com/resource-manager/docs/creating-managing-projects>`_.
1090 dataset: The name of the dataset hosted in BigQuery in the format of ``dataset_id.table_id``.
1091 Both the dataset_id and table_id must exist otherwise an exception will be raised.
1092 query: The SQL query to execute. `query` and `dataset` are mutually exclusive.
1093 If `query` is provided, the query result is read as the dataset.
1094 parallelism: This argument is deprecated. Use ``override_num_blocks`` argument.
1095 num_cpus: The number of CPUs to reserve for each parallel read worker.
1096 num_gpus: The number of GPUs to reserve for each parallel read worker. For
1097 example, specify `num_gpus=1` to request 1 GPU for each parallel read

Callers

nothing calls this directly

Calls 2

BigQueryDatasourceClass · 0.90
read_datasourceFunction · 0.85

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…