Create a :class:`~ray.data.Dataset` from a `Spark DataFrame `_. Args: df: A `Spark DataFrame`_, which must be created by RayDP (Spark-on-Ray). parallelism: This argument is
(
df: "pyspark.sql.DataFrame",
*,
parallelism: Optional[int] = None,
override_num_blocks: Optional[int] = None,
)
| 3876 | |
| 3877 | @PublicAPI |
| 3878 | def from_spark( |
| 3879 | df: "pyspark.sql.DataFrame", |
| 3880 | *, |
| 3881 | parallelism: Optional[int] = None, |
| 3882 | override_num_blocks: Optional[int] = None, |
| 3883 | ) -> MaterializedDataset: |
| 3884 | """Create a :class:`~ray.data.Dataset` from a |
| 3885 | `Spark DataFrame <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html>`_. |
| 3886 | |
| 3887 | Args: |
| 3888 | df: A `Spark DataFrame`_, which must be created by RayDP (Spark-on-Ray). |
| 3889 | parallelism: This argument is deprecated. Use ``override_num_blocks`` argument. |
| 3890 | override_num_blocks: Override the number of output blocks from all read tasks. |
| 3891 | By default, the number of output blocks is dynamically decided based on |
| 3892 | input data size and available resources. You shouldn't manually set this |
| 3893 | value in most cases. |
| 3894 | |
| 3895 | Returns: |
| 3896 | A :class:`~ray.data.MaterializedDataset` holding rows read from the DataFrame. |
| 3897 | """ # noqa: E501 |
| 3898 | import raydp |
| 3899 | |
| 3900 | parallelism = _get_num_output_blocks(parallelism, override_num_blocks) |
| 3901 | return raydp.spark.spark_dataframe_to_ray_dataset(df, parallelism) |
| 3902 | |
| 3903 | |
| 3904 | @PublicAPI |
nothing calls this directly
no test coverage detected
searching dependent graphs…