Configuration for scaling training. Args: num_workers: The number of workers (Ray actors) to launch. Each worker will reserve 1 CPU by default. The number of CPUs reserved by each worker can be overridden with the ``resources_per_worker`` argument. If
| 29 | |
| 30 | @dataclass |
| 31 | class ScalingConfig(ScalingConfigV1): |
| 32 | """Configuration for scaling training. |
| 33 | |
| 34 | Args: |
| 35 | num_workers: The number of workers (Ray actors) to launch. |
| 36 | Each worker will reserve 1 CPU by default. The number of CPUs |
| 37 | reserved by each worker can be overridden with the |
| 38 | ``resources_per_worker`` argument. If the number of workers is 0, |
| 39 | the training function will run in local mode, meaning the training |
| 40 | function runs in the same process. To enable elasticity, provide a |
| 41 | ``(min_workers, max_workers)`` tuple of ints. |
| 42 | elastic_resize_monitor_interval_s: While the worker group is healthy, |
| 43 | consider resizing the worker group every |
| 44 | ``elastic_resize_monitor_interval_s`` seconds. |
| 45 | use_gpu: If True, training will be done on GPUs (1 per worker). |
| 46 | Defaults to False. The number of GPUs reserved by each |
| 47 | worker can be overridden with the ``resources_per_worker`` |
| 48 | argument. |
| 49 | resources_per_worker: If specified, the resources |
| 50 | defined in this Dict is reserved for each worker. |
| 51 | Define the ``"CPU"`` and ``"GPU"`` keys (case-sensitive) to |
| 52 | override the number of CPU or GPUs used by each worker. |
| 53 | placement_strategy: The placement strategy to use for the |
| 54 | placement group of the Ray actors. See :ref:`Placement Group |
| 55 | Strategies <pgroup-strategy>` for the possible options. |
| 56 | label_selector: A list of label selectors for Ray Train worker placement. |
| 57 | If a single label selector is provided, it will be applied to all Ray Train workers. |
| 58 | If a list is provided, it must be the same length as the max number of Ray Train workers. |
| 59 | accelerator_type: [Experimental] If specified, Ray Train will launch the |
| 60 | training coordinator and workers on the nodes with the specified type |
| 61 | of accelerators. |
| 62 | See :ref:`the available accelerator types <accelerator_types>`. |
| 63 | Ensure that your cluster has instances with the specified accelerator type |
| 64 | or is able to autoscale to fulfill the request. This field is required |
| 65 | when `use_tpu` is True and `num_workers` is greater than 1. |
| 66 | use_tpu: [Experimental] If True, training will be done on TPUs (1 TPU VM |
| 67 | per worker). Defaults to False. The number of TPUs reserved by each |
| 68 | worker can be overridden with the ``resources_per_worker`` |
| 69 | argument. This arg enables SPMD execution of the training workload. |
| 70 | topology: [Experimental] If specified, Ray Train will launch the training |
| 71 | coordinator and workers on nodes with the specified topology. Topology is |
| 72 | auto-detected for TPUs and added as Ray node labels. This arg enables |
| 73 | SPMD execution of the training workload. This field is required |
| 74 | when `use_tpu` is True and `num_workers` is greater than 1. |
| 75 | """ |
| 76 | |
| 77 | num_workers: Union[int, Tuple[int, int]] = 1 |
| 78 | trainer_resources: Optional[dict] = None |
| 79 | label_selector: Optional[Union[Dict[str, str], List[Dict[str, str]]]] = None |
| 80 | |
| 81 | # Accelerator specific fields. |
| 82 | use_tpu: Union[bool] = False |
| 83 | topology: Optional[str] = None |
| 84 | |
| 85 | # Elasticity specific fields. |
| 86 | elastic_resize_monitor_interval_s: float = 60.0 |
| 87 | |
| 88 | def __post_init__(self): |
no outgoing calls
searching dependent graphs…