MCPcopy
hub / github.com/ray-project/ray / from_spark

Function from_spark

python/ray/data/read_api.py:3878–3901  ·  view source on GitHub ↗

Create a :class:`~ray.data.Dataset` from a `Spark DataFrame `_. Args: df: A `Spark DataFrame`_, which must be created by RayDP (Spark-on-Ray). parallelism: This argument is

(
    df: "pyspark.sql.DataFrame",
    *,
    parallelism: Optional[int] = None,
    override_num_blocks: Optional[int] = None,
)

Source from the content-addressed store, hash-verified

3876
3877@PublicAPI
3878def from_spark(
3879 df: "pyspark.sql.DataFrame",
3880 *,
3881 parallelism: Optional[int] = None,
3882 override_num_blocks: Optional[int] = None,
3883) -> MaterializedDataset:
3884 """Create a :class:`~ray.data.Dataset` from a
3885 `Spark DataFrame <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html>`_.
3886
3887 Args:
3888 df: A `Spark DataFrame`_, which must be created by RayDP (Spark-on-Ray).
3889 parallelism: This argument is deprecated. Use ``override_num_blocks`` argument.
3890 override_num_blocks: Override the number of output blocks from all read tasks.
3891 By default, the number of output blocks is dynamically decided based on
3892 input data size and available resources. You shouldn&#x27;t manually set this
3893 value in most cases.
3894
3895 Returns:
3896 A :class:`~ray.data.MaterializedDataset` holding rows read from the DataFrame.
3897 """ # noqa: E501
3898 import raydp
3899
3900 parallelism = _get_num_output_blocks(parallelism, override_num_blocks)
3901 return raydp.spark.spark_dataframe_to_ray_dataset(df, parallelism)
3902
3903
3904@PublicAPI

Callers

nothing calls this directly

Calls 1

_get_num_output_blocksFunction · 0.85

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…