Creates a distributed session. It calls `MonitoredTrainingSession` to create a :class:`MonitoredSession` for distributed training. Parameters ---------- task_spec : :class:`TaskSpecDef`. The task spec definition from create_task_spec_def() checkpoint_dir : str.
(
task_spec=None, checkpoint_dir=None, scaffold=None, hooks=None, chief_only_hooks=None, save_checkpoint_secs=600,
save_summaries_steps=object(), save_summaries_secs=object(), config=None, stop_grace_period_secs=120,
log_step_count_steps=100
)
| 395 | |
| 396 | @deprecated(date="2018-10-30", instructions="Using the TensorLayer distributed trainer.") |
| 397 | def create_distributed_session( |
| 398 | task_spec=None, checkpoint_dir=None, scaffold=None, hooks=None, chief_only_hooks=None, save_checkpoint_secs=600, |
| 399 | save_summaries_steps=object(), save_summaries_secs=object(), config=None, stop_grace_period_secs=120, |
| 400 | log_step_count_steps=100 |
| 401 | ): |
| 402 | """Creates a distributed session. |
| 403 | |
| 404 | It calls `MonitoredTrainingSession` to create a :class:`MonitoredSession` for distributed training. |
| 405 | |
| 406 | Parameters |
| 407 | ---------- |
| 408 | task_spec : :class:`TaskSpecDef`. |
| 409 | The task spec definition from create_task_spec_def() |
| 410 | checkpoint_dir : str. |
| 411 | Optional path to a directory where to restore variables. |
| 412 | scaffold : ``Scaffold`` |
| 413 | A `Scaffold` used for gathering or building supportive ops. |
| 414 | If not specified, a default one is created. It's used to finalize the graph. |
| 415 | hooks : list of ``SessionRunHook`` objects. |
| 416 | Optional |
| 417 | chief_only_hooks : list of ``SessionRunHook`` objects. |
| 418 | Activate these hooks if `is_chief==True`, ignore otherwise. |
| 419 | save_checkpoint_secs : int |
| 420 | The frequency, in seconds, that a checkpoint is saved |
| 421 | using a default checkpoint saver. If `save_checkpoint_secs` is set to |
| 422 | `None`, then the default checkpoint saver isn't used. |
| 423 | save_summaries_steps : int |
| 424 | The frequency, in number of global steps, that the |
| 425 | summaries are written to disk using a default summary saver. If both |
| 426 | `save_summaries_steps` and `save_summaries_secs` are set to `None`, then |
| 427 | the default summary saver isn't used. Default 100. |
| 428 | save_summaries_secs : int |
| 429 | The frequency, in secs, that the summaries are written |
| 430 | to disk using a default summary saver. If both `save_summaries_steps` and |
| 431 | `save_summaries_secs` are set to `None`, then the default summary saver |
| 432 | isn't used. Default not enabled. |
| 433 | config : ``tf.ConfigProto`` |
| 434 | an instance of `tf.ConfigProto` proto used to configure the session. |
| 435 | It's the `config` argument of constructor of `tf.Session`. |
| 436 | stop_grace_period_secs : int |
| 437 | Number of seconds given to threads to stop after |
| 438 | `close()` has been called. |
| 439 | log_step_count_steps : int |
| 440 | The frequency, in number of global steps, that the |
| 441 | global step/sec is logged. |
| 442 | |
| 443 | Examples |
| 444 | -------- |
| 445 | A simple example for distributed training where all the workers use the same dataset: |
| 446 | |
| 447 | >>> task_spec = TaskSpec() |
| 448 | >>> with tf.device(task_spec.device_fn()): |
| 449 | >>> tensors = create_graph() |
| 450 | >>> with tl.DistributedSession(task_spec=task_spec, |
| 451 | ... checkpoint_dir='/tmp/ckpt') as session: |
| 452 | >>> while not session.should_stop(): |
| 453 | >>> session.run(tensors) |
| 454 |