hub / github.com/deepspeedai/DeepSpeed / log_rank_file

Function log_rank_file

deepspeed/utils/debug.py:123–152 · view source on GitHub ↗

Print to a log file of the given rank This is useful for debugging hanging in sync processes. Here is a possible workflow: 1. Enable the force debug in say partitioning and zero3 files 2. Override the usual versions of print_rank_0 in those files with :: def print_rank_0(

(rank, *msgs)

Source from the content-addressed store, hash-verified

121
122
123	def log_rank_file(rank, *msgs):
124	"""
125	Print to a log file of the given rank
126
127	This is useful for debugging hanging in sync processes. Here is a possible workflow:
128
129	1. Enable the force debug in say partitioning and zero3 files
130	2. Override the usual versions of print_rank_0 in those files with ::
131
132	def print_rank_0(message, debug=False, force=False):
133	rank = deepspeed.comm.get_rank()
134	log_rank_file(rank, message)
135
136	3. run the program
137	4. fix up the expected differences, e.g. different cuda numbers ::
138
139	perl -pi -e 's\|cuda:1\|cuda:0\|' log_rank_*
140
141	5. now diff and see where names and ids diverge - you will find where the gpus don't do the same
142	work (e.g. when some layers get conditionally skipped on one gpu but not all)
143
144	diff -u log_rank_0.txt log_rank_1.txt \| less
145
146	"""
147	global fh
148	if fh is None:
149	fh = open(f"log_rank_{rank}.txt", "w")
150	for m in msgs:
151	fh.write(f"{m}\n")
152	fh.flush()
153
154
155	def print_backward_tensors(tensor):

Callers

nothing calls this directly

Calls 2

writeMethod · 0.45

flushMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…