Create a dataset from BigQuery. The data to read from is specified via the ``project_id``, ``dataset`` and/or ``query`` parameters. The dataset is created from the results of executing ``query`` if a query is provided. Otherwise, the entire ``dataset`` is read. For more informa
(
project_id: str,
dataset: Optional[str] = None,
query: Optional[str] = None,
*,
parallelism: int = -1,
num_cpus: Optional[float] = None,
num_gpus: Optional[float] = None,
memory: Optional[float] = None,
ray_remote_args: Dict[str, Any] = None,
concurrency: Optional[int] = None,
override_num_blocks: Optional[int] = None,
)
| 1038 | |
| 1039 | @PublicAPI(stability="alpha") |
| 1040 | def read_bigquery( |
| 1041 | project_id: str, |
| 1042 | dataset: Optional[str] = None, |
| 1043 | query: Optional[str] = None, |
| 1044 | *, |
| 1045 | parallelism: int = -1, |
| 1046 | num_cpus: Optional[float] = None, |
| 1047 | num_gpus: Optional[float] = None, |
| 1048 | memory: Optional[float] = None, |
| 1049 | ray_remote_args: Dict[str, Any] = None, |
| 1050 | concurrency: Optional[int] = None, |
| 1051 | override_num_blocks: Optional[int] = None, |
| 1052 | ) -> Dataset: |
| 1053 | """Create a dataset from BigQuery. |
| 1054 | |
| 1055 | The data to read from is specified via the ``project_id``, ``dataset`` |
| 1056 | and/or ``query`` parameters. The dataset is created from the results of |
| 1057 | executing ``query`` if a query is provided. Otherwise, the entire |
| 1058 | ``dataset`` is read. |
| 1059 | |
| 1060 | For more information about BigQuery, see the following concepts: |
| 1061 | |
| 1062 | - Project id: `Creating and Managing Projects <https://cloud.google.com/resource-manager/docs/creating-managing-projects>`_ |
| 1063 | |
| 1064 | - Dataset: `Datasets Intro <https://cloud.google.com/bigquery/docs/datasets-intro>`_ |
| 1065 | |
| 1066 | - Query: `Query Syntax <https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax>`_ |
| 1067 | |
| 1068 | This method uses the BigQuery Storage Read API which reads in parallel, |
| 1069 | with a Ray read task to handle each stream. The number of streams is |
| 1070 | determined by ``parallelism`` which can be requested from this interface |
| 1071 | or automatically chosen if unspecified (see the ``parallelism`` arg below). |
| 1072 | |
| 1073 | .. warning:: |
| 1074 | The maximum query response size is 10GB. |
| 1075 | |
| 1076 | Examples: |
| 1077 | .. testcode:: |
| 1078 | :skipif: True |
| 1079 | |
| 1080 | import ray |
| 1081 | # Users will need to authenticate beforehand (https://cloud.google.com/sdk/gcloud/reference/auth/login) |
| 1082 | ds = ray.data.read_bigquery( |
| 1083 | project_id="my_project", |
| 1084 | query="SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 1000", |
| 1085 | ) |
| 1086 | |
| 1087 | Args: |
| 1088 | project_id: The name of the associated Google Cloud Project that hosts the dataset to read. |
| 1089 | For more information, see `Creating and Managing Projects <https://cloud.google.com/resource-manager/docs/creating-managing-projects>`_. |
| 1090 | dataset: The name of the dataset hosted in BigQuery in the format of ``dataset_id.table_id``. |
| 1091 | Both the dataset_id and table_id must exist otherwise an exception will be raised. |
| 1092 | query: The SQL query to execute. `query` and `dataset` are mutually exclusive. |
| 1093 | If `query` is provided, the query result is read as the dataset. |
| 1094 | parallelism: This argument is deprecated. Use ``override_num_blocks`` argument. |
| 1095 | num_cpus: The number of CPUs to reserve for each parallel read worker. |
| 1096 | num_gpus: The number of GPUs to reserve for each parallel read worker. For |
| 1097 | example, specify `num_gpus=1` to request 1 GPU for each parallel read |
nothing calls this directly
no test coverage detected
searching dependent graphs…