MCPcopy
hub / github.com/deepspeedai/DeepSpeed / log_rank_file

Function log_rank_file

deepspeed/utils/debug.py:123–152  ·  view source on GitHub ↗

Print to a log file of the given rank This is useful for debugging hanging in sync processes. Here is a possible workflow: 1. Enable the force debug in say partitioning and zero3 files 2. Override the usual versions of print_rank_0 in those files with :: def print_rank_0(

(rank, *msgs)

Source from the content-addressed store, hash-verified

121
122
123def log_rank_file(rank, *msgs):
124 """
125 Print to a log file of the given rank
126
127 This is useful for debugging hanging in sync processes. Here is a possible workflow:
128
129 1. Enable the force debug in say partitioning and zero3 files
130 2. Override the usual versions of print_rank_0 in those files with ::
131
132 def print_rank_0(message, debug=False, force=False):
133 rank = deepspeed.comm.get_rank()
134 log_rank_file(rank, message)
135
136 3. run the program
137 4. fix up the expected differences, e.g. different cuda numbers ::
138
139 perl -pi -e 's|cuda:1|cuda:0|' log_rank_*
140
141 5. now diff and see where names and ids diverge - you will find where the gpus don't do the same
142 work (e.g. when some layers get conditionally skipped on one gpu but not all)
143
144 diff -u log_rank_0.txt log_rank_1.txt | less
145
146 """
147 global fh
148 if fh is None:
149 fh = open(f"log_rank_{rank}.txt", "w")
150 for m in msgs:
151 fh.write(f"{m}\n")
152 fh.flush()
153
154
155def print_backward_tensors(tensor):

Callers

nothing calls this directly

Calls 2

writeMethod · 0.45
flushMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…