hub / github.com/stas00/ml-engineering / see_memory_usage

Function see_memory_usage

debug/code/see-mem-usage.py:61–117 · view source on GitHub ↗

Arguments: - `message`: a pre-amble message to print before the counter dumps - useful for annotating where each measurement has been taken - e.g. "before foo" and later "after foo" - `force`: allows you to leave see_memory_usage in the code w/o running the code, set `force=True

(message, force=False, ranks=[0])

Source from the content-addressed store, hash-verified

59
60
61	def see_memory_usage(message, force=False, ranks=[0]):
62	"""
63	Arguments:
64	- `message`: a pre-amble message to print before the counter dumps - useful for annotating where each measurement has been taken - e.g. "before foo" and later "after foo"
65	- `force`: allows you to leave see_memory_usage in the code w/o running the code, set `force=True` to activate
66	- `ranks`: by default prints only on rank 0 but if needing to debug other ranks, pass the list of desirable ranks, e.g., `ranks=[1,3]`
67
68	You want to make sure `pip install nvidia-ml-py` is run, so that the report include not only the CUDA memory report but the total gpu memory usage, since CUDA memory allocator is not always used. e.g. NCCL memory allocations aren't visible by CUDA and thus aren't reported, but can consume GBs of gpu memory.
69
70	Pattern of usage:
71
72	see_memory_usage("before fwd", force=True)
73	output = model(**inputs)
74	see_memory_usage("before bwd", force=True)
75	output.loss.backward()
76	see_memory_usage("before step", force=True)
77	optimizer.step()
78	see_memory_usage("after step", force=True)
79
80	"""
81	if not force:
82	return
83	rank = dist.get_rank() if dist.is_initialized() else 0
84	if rank not in ranks:
85	return
86
87	# python doesn't do real-time garbage collection so do it explicitly to get the correct RAM reports
88	gc.collect()
89
90	# In some situations we want to flush the cache but not others, so for now let the developer
91	# override this manually - by default it should not be called. when it's not enabled use the
92	# MA_* numbers to get the real memory usage, rather than CA_* ones
93	# torch.cuda.empty_cache()
94
95	# collect raw memory usage outside pytorch
96	nv_mem = get_nvml_mem()
97
98	vm_stats = psutil.virtual_memory()
99	used_GB = round(((vm_stats.total - vm_stats.available) / (1024**3)), 2)
100
101	accelerator_mem_str = " \| ".join(
102	[
103	f"MA {round(torch.cuda.memory_allocated() / 2**30, 2):0.2f} GB",
104	f"Max_MA {round(torch.cuda.max_memory_allocated() / 2**30, 2):0.2f} GB",
105	f"CA {round(torch.cuda.memory_reserved() / 2**30, 2):0.2f} GB",
106	f"Max_CA {round(torch.cuda.max_memory_reserved() / 2**30, 2):0.2f} GB",
107	f"NV {round(nv_mem / 2**30, 2):0.2f} GB",
108	]
109	)
110	cpu_mem_str = f"CPU Virtual Memory: used = {used_GB} GB, percent = {vm_stats.percent}%"
111
112	# add '[rank] mp' prefix to enable easy grep
113	print(f"[{rank}] mp: {message}")
114	print(f"[{rank}] mp: " + " \| ".join([accelerator_mem_str, cpu_mem_str]))
115
116	# get the peak memory to report correct data, so reset the counter for the next call
117	torch.cuda.reset_peak_memory_stats()
118

Callers 1

see-mem-usage.pyFile · 0.70

Calls 2

get_nvml_memFunction · 0.85

printFunction · 0.85

Tested by

no test coverage detected