MCPcopy
hub / github.com/ray-project/ray / FailureConfig

Class FailureConfig

python/ray/train/v2/api/config.py:343–361  ·  view source on GitHub ↗

Configuration related to failure handling of each training run. Args: max_failures: Tries to recover a run from training worker errors at least this many times. Will recover from the latest checkpoint if present. Setting to -1 will lead to infinite recovery retri

Source from the content-addressed store, hash-verified

341
342@dataclass
343class FailureConfig(FailureConfigV1):
344 """Configuration related to failure handling of each training run.
345
346 Args:
347 max_failures: Tries to recover a run from training worker errors at least this many times.
348 Will recover from the latest checkpoint if present.
349 Setting to -1 will lead to infinite recovery retries.
350 Setting to 0 will disable retries. Defaults to 0.
351 controller_failure_limit: [DeveloperAPI] The maximum number of controller failures to tolerate.
352 Setting to -1 will lead to infinite controller retries.
353 Setting to 0 will disable controller retries. Defaults to -1.
354 """
355
356 fail_fast: Union[bool, str] = _DEPRECATED
357 controller_failure_limit: int = -1
358
359 def __post_init__(self):
360 if self.fail_fast != _DEPRECATED:
361 raise DeprecationWarning(FAIL_FAST_DEPRECATION_MESSAGE)
362
363
364@PublicAPI(stability="alpha")

Calls

no outgoing calls

Used in the wild real call sites across dependent graphs

searching dependent graphs…