Controls saving and loading of workspaces on every epoch boundary of a job. If a CheckpointManager instance is passed to JobRunner, then JobRunner will call `init`, `read` and `save` at different moments in between epoch runs. Args: db_prefix: The prefix used to construct f
| 147 | |
| 148 | |
| 149 | class CheckpointManager: |
| 150 | """ |
| 151 | Controls saving and loading of workspaces on every epoch boundary of a job. |
| 152 | If a CheckpointManager instance is passed to JobRunner, then JobRunner will |
| 153 | call `init`, `read` and `save` at different moments in between epoch runs. |
| 154 | |
| 155 | Args: |
| 156 | db_prefix: The prefix used to construct full db name. Since `absolute_path` |
| 157 | is set to True, this will be used as db_name in SaveOp. |
| 158 | node_name: Name of the node where this checkpoint_manager is used. |
| 159 | db_type: Type of database to use for storing checkpoint. |
| 160 | metadata_handler: An optional object capable of reading/writing |
| 161 | checkpoint info in storage of choice. |
| 162 | """ |
| 163 | |
| 164 | BLOB_NAMES = "blob_names" |
| 165 | |
| 166 | def __init__(self, db_prefix, node_name, db_type, metadata_handler=None): |
| 167 | self._db_prefix = db_prefix |
| 168 | self._node_name = node_name |
| 169 | self._db_type = db_type |
| 170 | self._metadata_handler = metadata_handler |
| 171 | # make sure these blobs are the first in the checkpoint file. |
| 172 | self._net = core.Net('!!checkpoint_mngr') |
| 173 | self._blob_names = self._net.AddExternalInput(self.BLOB_NAMES) |
| 174 | self._names_output = None |
| 175 | self._path_prefix = None |
| 176 | self._path_type = None |
| 177 | self._current_db_name = None |
| 178 | self._current_checkpoint_duration = None |
| 179 | |
| 180 | """ |
| 181 | Initialize the checkpoint manager. Determines all blobs that need to be saved |
| 182 | or loads from a checkpoint. |
| 183 | |
| 184 | Args: |
| 185 | nodes: An array of nodes where this checkpoint manager is running. Should |
| 186 | only contain a single node. |
| 187 | retrieve_from_epoch: Set to a number to load blobs from this epoch. |
| 188 | path_prefix: Used to construct db name or path where checkpoint files are |
| 189 | stored. |
| 190 | path_type: Indicate the type of path where checkpoint files are stored. |
| 191 | """ |
| 192 | def init( |
| 193 | self, |
| 194 | nodes=None, |
| 195 | retrieve_from_epoch=None, |
| 196 | path_prefix=None, |
| 197 | path_type=None |
| 198 | ): |
| 199 | """ |
| 200 | Build a Task that will be run once after the job's `init_group` is run. |
| 201 | This task will determine which blobs need to be checkpointed. |
| 202 | If retrieve_from_epoch is not None, then the checkpoint metadata is |
| 203 | retrieved from a previously saved checkpoint. |
| 204 | """ |
| 205 | assert nodes is None or len(nodes) == 1, ( |
| 206 | 'CheckpointManager only supports single node.') |
no outgoing calls
searching dependent graphs…