Arguments: - `message`: a pre-amble message to print before the counter dumps - useful for annotating where each measurement has been taken - e.g. "before foo" and later "after foo" - `force`: allows you to leave see_memory_usage in the code w/o running the code, set `force=True
(message, force=False, ranks=[0])
| 59 | |
| 60 | |
| 61 | def see_memory_usage(message, force=False, ranks=[0]): |
| 62 | """ |
| 63 | Arguments: |
| 64 | - `message`: a pre-amble message to print before the counter dumps - useful for annotating where each measurement has been taken - e.g. "before foo" and later "after foo" |
| 65 | - `force`: allows you to leave see_memory_usage in the code w/o running the code, set `force=True` to activate |
| 66 | - `ranks`: by default prints only on rank 0 but if needing to debug other ranks, pass the list of desirable ranks, e.g., `ranks=[1,3]` |
| 67 | |
| 68 | You want to make sure `pip install nvidia-ml-py` is run, so that the report include not only the CUDA memory report but the total gpu memory usage, since CUDA memory allocator is not always used. e.g. NCCL memory allocations aren't visible by CUDA and thus aren't reported, but can consume GBs of gpu memory. |
| 69 | |
| 70 | Pattern of usage: |
| 71 | |
| 72 | see_memory_usage("before fwd", force=True) |
| 73 | output = model(**inputs) |
| 74 | see_memory_usage("before bwd", force=True) |
| 75 | output.loss.backward() |
| 76 | see_memory_usage("before step", force=True) |
| 77 | optimizer.step() |
| 78 | see_memory_usage("after step", force=True) |
| 79 | |
| 80 | """ |
| 81 | if not force: |
| 82 | return |
| 83 | rank = dist.get_rank() if dist.is_initialized() else 0 |
| 84 | if rank not in ranks: |
| 85 | return |
| 86 | |
| 87 | # python doesn't do real-time garbage collection so do it explicitly to get the correct RAM reports |
| 88 | gc.collect() |
| 89 | |
| 90 | # In some situations we want to flush the cache but not others, so for now let the developer |
| 91 | # override this manually - by default it should not be called. when it's not enabled use the |
| 92 | # MA_* numbers to get the real memory usage, rather than CA_* ones |
| 93 | # torch.cuda.empty_cache() |
| 94 | |
| 95 | # collect raw memory usage outside pytorch |
| 96 | nv_mem = get_nvml_mem() |
| 97 | |
| 98 | vm_stats = psutil.virtual_memory() |
| 99 | used_GB = round(((vm_stats.total - vm_stats.available) / (1024**3)), 2) |
| 100 | |
| 101 | accelerator_mem_str = " | ".join( |
| 102 | [ |
| 103 | f"MA {round(torch.cuda.memory_allocated() / 2**30, 2):0.2f} GB", |
| 104 | f"Max_MA {round(torch.cuda.max_memory_allocated() / 2**30, 2):0.2f} GB", |
| 105 | f"CA {round(torch.cuda.memory_reserved() / 2**30, 2):0.2f} GB", |
| 106 | f"Max_CA {round(torch.cuda.max_memory_reserved() / 2**30, 2):0.2f} GB", |
| 107 | f"NV {round(nv_mem / 2**30, 2):0.2f} GB", |
| 108 | ] |
| 109 | ) |
| 110 | cpu_mem_str = f"CPU Virtual Memory: used = {used_GB} GB, percent = {vm_stats.percent}%" |
| 111 | |
| 112 | # add '[rank] mp' prefix to enable easy grep |
| 113 | print(f"[{rank}] mp: {message}") |
| 114 | print(f"[{rank}] mp: " + " | ".join([accelerator_mem_str, cpu_mem_str])) |
| 115 | |
| 116 | # get the peak memory to report correct data, so reset the counter for the next call |
| 117 | torch.cuda.reset_peak_memory_stats() |
| 118 |
no test coverage detected