A context manager to partition the model parameters during the model construction with MiCS partition strategy. Model states are partitioned to the number of devices specified via ``mics_shard_size`` field in the deepspeed config json file. The context manager also introduces
(self,
module=None,
data_parallel_group=None,
sequence_data_parallel_group=None,
mem_efficient_linear=True,
remote_device=None,
pin_memory=False,
config_dict_or_path=None,
config=None,
enabled=True,
dtype=None,
mpu=None)
| 63 | class MiCS_Init(Init): |
| 64 | |
| 65 | def __init__(self, |
| 66 | module=None, |
| 67 | data_parallel_group=None, |
| 68 | sequence_data_parallel_group=None, |
| 69 | mem_efficient_linear=True, |
| 70 | remote_device=None, |
| 71 | pin_memory=False, |
| 72 | config_dict_or_path=None, |
| 73 | config=None, |
| 74 | enabled=True, |
| 75 | dtype=None, |
| 76 | mpu=None): |
| 77 | """A context manager to partition the model parameters during the model |
| 78 | construction with MiCS partition strategy. Model states are partitioned |
| 79 | to the number of devices specified via ``mics_shard_size`` field in the |
| 80 | deepspeed config json file. The context manager also introduces |
| 81 | hierarchical communication method to reduce the cost of inter-node |
| 82 | communications, which can be enabled with |
| 83 | ``mics_hierarchical_params_gather`` field in deepspeed config. |
| 84 | |
| 85 | Args: |
| 86 | module (``torch.nn.Module``, optional): If provided, partition the model as |
| 87 | if it was constructed in the context. |
| 88 | data_parallel_group (``deepspeed.comm`` process group, optional): |
| 89 | The group of processes to partition among. Defaults to all processes. |
| 90 | Synonymous with sequence data parallel group for param partitioning |
| 91 | across both sequence and data parallel groups. |
| 92 | mem_efficient_linear (bool, optional): Replace |
| 93 | torch.nn.functional.linear with an implementation that allows |
| 94 | DeepSpeed to partition parameters. Defaults to ``True``. |
| 95 | remote_device (string, optional): The initial device to store model |
| 96 | weights e.g., ``cpu``, ``nvme``. Passing ``"cpu"`` will create the model in CPU |
| 97 | memory. The model may still be moved to GPU based on the |
| 98 | offload settings for training. Defaults to param offload device if a config is |
| 99 | defined, otherwise GPU. |
| 100 | pin_memory (bool, optional): Potentially increase performance by |
| 101 | using pinned memory for model weights. ``remote_device`` must be |
| 102 | ``"cpu"``. Defaults to pin_memory value in config, otherwise ``False``. |
| 103 | config_dict_or_path (dict or ``json file``, optional): If provided, provides configuration |
| 104 | for swapping fp16 params to NVMe. |
| 105 | config (dict or ``json file``, optional): Deprecated, use config_dict_or_path instead. |
| 106 | enabled (bool, optional): If ``False``, this context has no |
| 107 | effect. Defaults to ``True``. |
| 108 | dtype (``dtype``, optional): Can be used to change the data type of the parameters. |
| 109 | Supported options are ``torch.half`` and ``torch.float``. Defaults to ``None`` |
| 110 | mpu (``object``, optional): A model parallelism unit object that implements get_{model,data}_parallel_{rank,group,world_size}. |
| 111 | |
| 112 | This context follows the same logic as ``deepspeed.zero.Init()``, but |
| 113 | with the modification for partition size of each parameter. |
| 114 | |
| 115 | Examples |
| 116 | -------- |
| 117 | |
| 118 | #. Allocate a model and partition it among all processes: |
| 119 | |
| 120 | .. code-block:: python |
| 121 | # the config_dict_or_path is required to let the context manager know |
| 122 | # how partition the parameters. |
no test coverage detected