Print to a log file of the given rank This is useful for debugging hanging in sync processes. Here is a possible workflow: 1. Enable the force debug in say partitioning and zero3 files 2. Override the usual versions of print_rank_0 in those files with :: def print_rank_0(
(rank, *msgs)
| 121 | |
| 122 | |
| 123 | def log_rank_file(rank, *msgs): |
| 124 | """ |
| 125 | Print to a log file of the given rank |
| 126 | |
| 127 | This is useful for debugging hanging in sync processes. Here is a possible workflow: |
| 128 | |
| 129 | 1. Enable the force debug in say partitioning and zero3 files |
| 130 | 2. Override the usual versions of print_rank_0 in those files with :: |
| 131 | |
| 132 | def print_rank_0(message, debug=False, force=False): |
| 133 | rank = deepspeed.comm.get_rank() |
| 134 | log_rank_file(rank, message) |
| 135 | |
| 136 | 3. run the program |
| 137 | 4. fix up the expected differences, e.g. different cuda numbers :: |
| 138 | |
| 139 | perl -pi -e 's|cuda:1|cuda:0|' log_rank_* |
| 140 | |
| 141 | 5. now diff and see where names and ids diverge - you will find where the gpus don't do the same |
| 142 | work (e.g. when some layers get conditionally skipped on one gpu but not all) |
| 143 | |
| 144 | diff -u log_rank_0.txt log_rank_1.txt | less |
| 145 | |
| 146 | """ |
| 147 | global fh |
| 148 | if fh is None: |
| 149 | fh = open(f"log_rank_{rank}.txt", "w") |
| 150 | for m in msgs: |
| 151 | fh.write(f"{m}\n") |
| 152 | fh.flush() |
| 153 | |
| 154 | |
| 155 | def print_backward_tensors(tensor): |