hub / github.com/dmlc/dgl / _shared_step

Method _shared_step

python/dgl/optim/pytorch/sparse_optim.py:202–433 · view source on GitHub ↗

(self)

Source from the content-addressed store, hash-verified

200	self.update(idx, grad, emb)
201
202	def _shared_step(self):
203	with th.no_grad():
204	# Frequently alloc and free shared memory to hold intermediate tensor is expensive
205	# We cache shared memory buffers in shared_emb.
206	shared_emb = {emb.name: ([], []) for emb in self._params}
207
208	# Go through all sparse embeddings
209	for emb in self._params: # pylint: disable=too-many-nested-blocks
210	emb_name = emb.name
211
212	# we need to combine gradients from multiple forward paths
213	idx = []
214	grad = []
215	for i, data in emb._trace:
216	idx.append(i)
217	grad.append(data.grad.data)
218	# If the sparse embedding is not used in the previous forward step
219	# The idx and grad will be empty, initialize them as empty tensors to
220	# avoid crashing the optimizer step logic.
221	#
222	# Note: we cannot skip the gradient exchange and update steps as other
223	# working processes may send gradient update requests corresponding
224	# to certain embedding to this process.
225	idx = (
226	th.cat(idx, dim=0)
227	if len(idx) != 0
228	else th.zeros((0,), dtype=th.long, device=th.device("cpu"))
229	)
230	grad = (
231	th.cat(grad, dim=0)
232	if len(grad) != 0
233	else th.zeros(
234	(0, emb.embedding_dim),
235	dtype=th.float32,
236	device=th.device("cpu"),
237	)
238	)
239
240	device = grad.device
241	idx_dtype = idx.dtype
242	grad_dtype = grad.dtype
243	grad_dim = grad.shape[1]
244	if self._world_size > 1:
245	if emb_name not in self._shared_cache:
246	self._shared_cache[emb_name] = {}
247
248	# Each training process takes the resposibility of updating a range
249	# of node embeddings, thus we can parallel the gradient update.
250	# The overall progress includes:
251	# 1. In each training process:
252	# 1.a Deciding which process a node embedding belongs to according
253	# to the formula: process_id = node_idx mod num_of_process(N)
254	# 1.b Split the node index tensor and gradient tensor into N parts
255	# according to step 1.
256	# 1.c Write each node index sub-tensor and gradient sub-tensor into
257	# different DGL shared memory buffers.
258	# 2. Cross training process synchronization
259	# 3. In each traning process:

Callers 1

stepMethod · 0.95

Calls 10

updateMethod · 0.95

create_shared_mem_arrayFunction · 0.85

get_shared_mem_arrayFunction · 0.85

appendMethod · 0.80

formatMethod · 0.80

deviceMethod · 0.45

longMethod · 0.45

toMethod · 0.45

barrierMethod · 0.45

reset_traceMethod · 0.45

Tested by

no test coverage detected