Join :class:`Datasets ` on join keys. Args: ds: Other dataset to join against join_type: The kind of join that should be performed, one of ("inner", "left_outer", "right_outer", "full_outer", "left_semi", "right_semi",
(
self,
ds: "Dataset",
join_type: str,
num_partitions: int,
on: Tuple[str] = ("id",),
right_on: Optional[Tuple[str]] = None,
left_suffix: Optional[str] = None,
right_suffix: Optional[str] = None,
*,
partition_size_hint: Optional[int] = None,
aggregator_ray_remote_args: Optional[Dict[str, Any]] = None,
validate_schemas: bool = False,
)
| 3050 | @AllToAllAPI |
| 3051 | @PublicAPI(api_group=SMJ_API_GROUP) |
| 3052 | def join( |
| 3053 | self, |
| 3054 | ds: "Dataset", |
| 3055 | join_type: str, |
| 3056 | num_partitions: int, |
| 3057 | on: Tuple[str] = ("id",), |
| 3058 | right_on: Optional[Tuple[str]] = None, |
| 3059 | left_suffix: Optional[str] = None, |
| 3060 | right_suffix: Optional[str] = None, |
| 3061 | *, |
| 3062 | partition_size_hint: Optional[int] = None, |
| 3063 | aggregator_ray_remote_args: Optional[Dict[str, Any]] = None, |
| 3064 | validate_schemas: bool = False, |
| 3065 | ) -> "Dataset": |
| 3066 | """Join :class:`Datasets <ray.data.Dataset>` on join keys. |
| 3067 | |
| 3068 | Args: |
| 3069 | ds: Other dataset to join against |
| 3070 | join_type: The kind of join that should be performed, one of ("inner", |
| 3071 | "left_outer", "right_outer", "full_outer", "left_semi", "right_semi", |
| 3072 | "left_anti", "right_anti"). |
| 3073 | num_partitions: Total number of "partitions" input sequences will be split |
| 3074 | into with each partition being joined independently. Increasing number |
| 3075 | of partitions allows to reduce individual partition size, hence reducing |
| 3076 | memory requirements when individual partitions are being joined. Note |
| 3077 | that, consequently, this will also be a total number of blocks that will |
| 3078 | be produced as a result of executing join. |
| 3079 | on: The columns from the left operand that will be used as |
| 3080 | keys for the join operation. |
| 3081 | right_on: The columns from the right operand that will be |
| 3082 | used as keys for the join operation. When none, `on` will |
| 3083 | be assumed to be a list of columns to be used for the right dataset |
| 3084 | as well. |
| 3085 | left_suffix: (Optional) Suffix to be appended for columns of the left |
| 3086 | operand. |
| 3087 | right_suffix: (Optional) Suffix to be appended for columns of the right |
| 3088 | operand. |
| 3089 | partition_size_hint: (Optional) Hint to joining operator about the estimated |
| 3090 | avg expected size of the individual partition (in bytes). |
| 3091 | This is used in estimating the total dataset size and allow to tune |
| 3092 | memory requirement of the individual joining workers to prevent OOMs |
| 3093 | when joining very large datasets. |
| 3094 | aggregator_ray_remote_args: (Optional) Parameter overriding `ray.remote` |
| 3095 | args passed when constructing joining (aggregator) workers. |
| 3096 | validate_schemas: (Optional) Controls whether validation of provided |
| 3097 | configuration against input schemas will be performed (defaults to |
| 3098 | false, since obtaining schemas could be prohibitively expensive). |
| 3099 | |
| 3100 | Returns: |
| 3101 | A :class:`Dataset` that holds rows of input left Dataset joined with the |
| 3102 | right Dataset based on join type and keys. |
| 3103 | |
| 3104 | Examples: |
| 3105 | |
| 3106 | .. testcode:: |
| 3107 | :skipif: True |
| 3108 | |
| 3109 | doubles_ds = ray.data.range(4).map( |