hub / github.com/ray-project/ray / ScalingConfig

Class ScalingConfig

python/ray/train/v2/api/config.py:31–279 · view source on GitHub ↗

Configuration for scaling training. Args: num_workers: The number of workers (Ray actors) to launch. Each worker will reserve 1 CPU by default. The number of CPUs reserved by each worker can be overridden with the ``resources_per_worker`` argument. If

Source from the content-addressed store, hash-verified

29
30	@dataclass
31	class ScalingConfig(ScalingConfigV1):
32	"""Configuration for scaling training.
33
34	Args:
35	num_workers: The number of workers (Ray actors) to launch.
36	Each worker will reserve 1 CPU by default. The number of CPUs
37	reserved by each worker can be overridden with the
38	``resources_per_worker`` argument. If the number of workers is 0,
39	the training function will run in local mode, meaning the training
40	function runs in the same process. To enable elasticity, provide a
41	``(min_workers, max_workers)`` tuple of ints.
42	elastic_resize_monitor_interval_s: While the worker group is healthy,
43	consider resizing the worker group every
44	``elastic_resize_monitor_interval_s`` seconds.
45	use_gpu: If True, training will be done on GPUs (1 per worker).
46	Defaults to False. The number of GPUs reserved by each
47	worker can be overridden with the ``resources_per_worker``
48	argument.
49	resources_per_worker: If specified, the resources
50	defined in this Dict is reserved for each worker.
51	Define the ``"CPU"`` and ``"GPU"`` keys (case-sensitive) to
52	override the number of CPU or GPUs used by each worker.
53	placement_strategy: The placement strategy to use for the
54	placement group of the Ray actors. See :ref:`Placement Group
55	Strategies <pgroup-strategy>` for the possible options.
56	label_selector: A list of label selectors for Ray Train worker placement.
57	If a single label selector is provided, it will be applied to all Ray Train workers.
58	If a list is provided, it must be the same length as the max number of Ray Train workers.
59	accelerator_type: [Experimental] If specified, Ray Train will launch the
60	training coordinator and workers on the nodes with the specified type
61	of accelerators.
62	See :ref:`the available accelerator types <accelerator_types>`.
63	Ensure that your cluster has instances with the specified accelerator type
64	or is able to autoscale to fulfill the request. This field is required
65	when `use_tpu` is True and `num_workers` is greater than 1.
66	use_tpu: [Experimental] If True, training will be done on TPUs (1 TPU VM
67	per worker). Defaults to False. The number of TPUs reserved by each
68	worker can be overridden with the ``resources_per_worker``
69	argument. This arg enables SPMD execution of the training workload.
70	topology: [Experimental] If specified, Ray Train will launch the training
71	coordinator and workers on nodes with the specified topology. Topology is
72	auto-detected for TPUs and added as Ray node labels. This arg enables
73	SPMD execution of the training workload. This field is required
74	when `use_tpu` is True and `num_workers` is greater than 1.
75	"""
76
77	num_workers: Union[int, Tuple[int, int]] = 1
78	trainer_resources: Optional[dict] = None
79	label_selector: Optional[Union[Dict[str, str], List[Dict[str, str]]]] = None
80
81	# Accelerator specific fields.
82	use_tpu: Union[bool] = False
83	topology: Optional[str] = None
84
85	# Elasticity specific fields.
86	elastic_resize_monitor_interval_s: float = 60.0
87
88	def __post_init__(self):

Callers 15

test_torch_trainer_crashFunction · 0.90

test_trainingMethod · 0.90

test_checkpoint_freq_dir_nameFunction · 0.90

__init__Method · 0.90

test_keras_callback_e2eFunction · 0.90

__init__Method · 0.90

__repr__Method · 0.90

setupMethod · 0.90

default_resource_requestMethod · 0.90

test_report_mixed_checkpoint_upload_modesFunction · 0.90

Calls

no outgoing calls

Tested by 15

test_torch_trainer_crashFunction · 0.72

test_trainingMethod · 0.72

test_checkpoint_freq_dir_nameFunction · 0.72

__init__Method · 0.72

test_keras_callback_e2eFunction · 0.72

test_report_mixed_checkpoint_upload_modesFunction · 0.72

test_report_delete_local_checkpoint_after_uploadFunction · 0.72

test_report_checkpoint_upload_errorFunction · 0.72

test_report_validation_without_validation_fnFunction · 0.72

test_report_validation_without_checkpointFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…