Configuration related to failure handling of each training run. Args: max_failures: Tries to recover a run from training worker errors at least this many times. Will recover from the latest checkpoint if present. Setting to -1 will lead to infinite recovery retri
| 341 | |
| 342 | @dataclass |
| 343 | class FailureConfig(FailureConfigV1): |
| 344 | """Configuration related to failure handling of each training run. |
| 345 | |
| 346 | Args: |
| 347 | max_failures: Tries to recover a run from training worker errors at least this many times. |
| 348 | Will recover from the latest checkpoint if present. |
| 349 | Setting to -1 will lead to infinite recovery retries. |
| 350 | Setting to 0 will disable retries. Defaults to 0. |
| 351 | controller_failure_limit: [DeveloperAPI] The maximum number of controller failures to tolerate. |
| 352 | Setting to -1 will lead to infinite controller retries. |
| 353 | Setting to 0 will disable controller retries. Defaults to -1. |
| 354 | """ |
| 355 | |
| 356 | fail_fast: Union[bool, str] = _DEPRECATED |
| 357 | controller_failure_limit: int = -1 |
| 358 | |
| 359 | def __post_init__(self): |
| 360 | if self.fail_fast != _DEPRECATED: |
| 361 | raise DeprecationWarning(FAIL_FAST_DEPRECATION_MESSAGE) |
| 362 | |
| 363 | |
| 364 | @PublicAPI(stability="alpha") |
no outgoing calls
searching dependent graphs…