Global settings for Ray Data. Configure this class to enable advanced features and tune performance. .. warning:: Apply changes before creating a :class:`~ray.data.Dataset`. Changes made after won't take effect. .. note:: This object is automatically propagated
| 477 | @DeveloperAPI |
| 478 | @dataclass |
| 479 | class DataContext: |
| 480 | """Global settings for Ray Data. |
| 481 | |
| 482 | Configure this class to enable advanced features and tune performance. |
| 483 | |
| 484 | .. warning:: |
| 485 | Apply changes before creating a :class:`~ray.data.Dataset`. Changes made after |
| 486 | won't take effect. |
| 487 | |
| 488 | .. note:: |
| 489 | This object is automatically propagated to workers. Access it from the driver |
| 490 | and remote workers with :meth:`DataContext.get_current()`. |
| 491 | |
| 492 | Examples: |
| 493 | >>> from ray.data import DataContext |
| 494 | >>> DataContext.get_current().enable_progress_bars = False |
| 495 | |
| 496 | Args: |
| 497 | target_max_block_size: The max target block size in bytes for reads and |
| 498 | transformations. If `None`, this means the block size is infinite. |
| 499 | target_min_block_size: Ray Data avoids creating blocks smaller than this |
| 500 | size in bytes on read. This takes precedence over |
| 501 | ``read_op_min_num_blocks``. |
| 502 | streaming_read_buffer_size: Buffer size when doing streaming reads from local or |
| 503 | remote storage. |
| 504 | enable_pandas_block: Whether pandas block format is enabled. |
| 505 | actor_prefetcher_enabled: Whether to use actor based block prefetcher. |
| 506 | autoscaling_config: Autoscaling configuration. |
| 507 | use_push_based_shuffle: Whether to use push-based shuffle. |
| 508 | pipeline_push_based_shuffle_reduce_tasks: |
| 509 | scheduling_strategy: The global scheduling strategy. For tasks with large args, |
| 510 | ``scheduling_strategy_large_args`` takes precedence. |
| 511 | scheduling_strategy_large_args: Scheduling strategy for tasks with large args. |
| 512 | large_args_threshold: Size in bytes after which point task arguments are |
| 513 | considered large. Choose a value so that the data transfer overhead is |
| 514 | significant in comparison to task scheduling (i.e., low tens of ms). |
| 515 | use_polars: Whether to use Polars for tabular dataset sorts, groupbys, and |
| 516 | aggregations. |
| 517 | eager_free: Whether to eagerly free memory. |
| 518 | decoding_size_estimation: Whether to estimate in-memory decoding data size for |
| 519 | data source. |
| 520 | min_parallelism: This setting is deprecated. Use ``read_op_min_num_blocks`` |
| 521 | instead. |
| 522 | read_op_min_num_blocks: Minimum number of read output blocks for a dataset. |
| 523 | use_datasource_v2: When True, ``ray.data.read_parquet()`` routes through |
| 524 | the DataSourceV2 pipeline (``ListFiles → ReadFiles`` logical chain, |
| 525 | driver-side first-file sampling for schema inference, |
| 526 | ``ParquetScanner`` / ``ParquetFileReader``). Defaults to False — V1 |
| 527 | remains the production path while V2 bakes. |
| 528 | parquet_chunker_target_chunk_size: Target chunk size in bytes used by |
| 529 | ``ParquetFileChunker`` when splitting large Parquet files into |
| 530 | multiple read tasks. When ``None``, the chunker's built-in default |
| 531 | (currently 1 GiB) is used. |
| 532 | enable_tensor_extension_casting: Whether to automatically cast NumPy ndarray |
| 533 | columns in Pandas DataFrames to tensor extension columns. |
| 534 | arrow_fixed_shape_tensor_format: The tensor format to use for fixed-shape tensors. |
| 535 | Options are FixedShapeTensorFormat.V1, FixedShapeTensorFormat.V2, and FixedShapeTensorFormat.ARROW_NATIVE. |
| 536 | Default is V2. NOTE: For ARROW_NATIVE, only numbers (integers, floats) are currently supported. |
searching dependent graphs…