Method join

python/ray/data/dataset.py:3052–3223 · view source on GitHub ↗

Join :class:`Datasets ` on join keys. Args: ds: Other dataset to join against join_type: The kind of join that should be performed, one of ("inner", "left_outer", "right_outer", "full_outer", "left_semi", "right_semi",

(
        self,
        ds: "Dataset",
        join_type: str,
        num_partitions: int,
        on: Tuple[str] = ("id",),
        right_on: Optional[Tuple[str]] = None,
        left_suffix: Optional[str] = None,
        right_suffix: Optional[str] = None,
        *,
        partition_size_hint: Optional[int] = None,
        aggregator_ray_remote_args: Optional[Dict[str, Any]] = None,
        validate_schemas: bool = False,
    )

Source from the content-addressed store, hash-verified

3050	@AllToAllAPI
3051	@PublicAPI(api_group=SMJ_API_GROUP)
3052	def join(
3053	self,
3054	ds: "Dataset",
3055	join_type: str,
3056	num_partitions: int,
3057	on: Tuple[str] = ("id",),
3058	right_on: Optional[Tuple[str]] = None,
3059	left_suffix: Optional[str] = None,
3060	right_suffix: Optional[str] = None,
3061	*,
3062	partition_size_hint: Optional[int] = None,
3063	aggregator_ray_remote_args: Optional[Dict[str, Any]] = None,
3064	validate_schemas: bool = False,
3065	) -> "Dataset":
3066	"""Join :class:`Datasets <ray.data.Dataset>` on join keys.
3067
3068	Args:
3069	ds: Other dataset to join against
3070	join_type: The kind of join that should be performed, one of ("inner",
3071	"left_outer", "right_outer", "full_outer", "left_semi", "right_semi",
3072	"left_anti", "right_anti").
3073	num_partitions: Total number of "partitions" input sequences will be split
3074	into with each partition being joined independently. Increasing number
3075	of partitions allows to reduce individual partition size, hence reducing
3076	memory requirements when individual partitions are being joined. Note
3077	that, consequently, this will also be a total number of blocks that will
3078	be produced as a result of executing join.
3079	on: The columns from the left operand that will be used as
3080	keys for the join operation.
3081	right_on: The columns from the right operand that will be
3082	used as keys for the join operation. When none, `on` will
3083	be assumed to be a list of columns to be used for the right dataset
3084	as well.
3085	left_suffix: (Optional) Suffix to be appended for columns of the left
3086	operand.
3087	right_suffix: (Optional) Suffix to be appended for columns of the right
3088	operand.
3089	partition_size_hint: (Optional) Hint to joining operator about the estimated
3090	avg expected size of the individual partition (in bytes).
3091	This is used in estimating the total dataset size and allow to tune
3092	memory requirement of the individual joining workers to prevent OOMs
3093	when joining very large datasets.
3094	aggregator_ray_remote_args: (Optional) Parameter overriding `ray.remote`
3095	args passed when constructing joining (aggregator) workers.
3096	validate_schemas: (Optional) Controls whether validation of provided
3097	configuration against input schemas will be performed (defaults to
3098	false, since obtaining schemas could be prohibitively expensive).
3099
3100	Returns:
3101	A :class:`Dataset` that holds rows of input left Dataset joined with the
3102	right Dataset based on join type and keys.
3103
3104	Examples:
3105
3106	.. testcode::
3107	:skipif: True
3108
3109	doubles_ds = ray.data.range(4).map(

Callers 15

setup.pyFile · 0.45

find_versionFunction · 0.45

_find_bazel_binFunction · 0.45

fixed_isdirFunction · 0.45

replace_symlinks_with_junctionsFunction · 0.45

buildFunction · 0.45

_walk_thirdparty_dirFunction · 0.45

copy_fileFunction · 0.45

pip_runFunction · 0.45

_init_argsMethod · 0.45

_client_deprecation_warnMethod · 0.45

to_bytesMethod · 0.45

Calls 5

schemaMethod · 0.95

JoinClass · 0.90

LogicalPlanClass · 0.90

_validate_schemasMethod · 0.80

_from_parentMethod · 0.80

Tested by 15

test_create_new_directoryMethod · 0.36

test_existing_directoryMethod · 0.36

test_nested_directory_creationMethod · 0.36

test_tilde_expansionMethod · 0.36

start_usage_stats_serverFunction · 0.36

download_model_from_s3Function · 0.36

get_expected_contentMethod · 0.36

validate_transcription_streaming_chunksMethod · 0.36

test_vllm_config_ray_serve_vs_cli_comparisonFunction · 0.36

mock_serverFunction · 0.36

test_chatMethod · 0.36

test_completionMethod · 0.36