When running a PyTorch training program with num_workers=32
for DataLoader
, htop
shows 33 python process each with 32 GB of VIRT
and 15 GB of RES
.
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? htop
shows only about 50 GB of RAM and 20 GB of swap is being used on the entire machine with 128 GB of RAM. So, how do we explain the discrepancy?
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
Thank you
Memory management. PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi .
Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.
Num_workers tells the data loader instance how many sub-processes to use for data loading. If the num_worker is zero (default) the GPU has to weight for CPU to load data. Theoretically, greater the num_workers, more efficiently the CPU load data and less the GPU has to wait.
In CUDA terms, pinned memory does not mean GPU memory but non-paged CPU memory. The benefits and rationale are provided here, but the gist of it is that this flag allows the x.cuda() operation (which you still have to execute as usually) to avoid one implicit CPU-to-CPU copy, which makes it a bit more performant.
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory?
Not necessary. You have a worker process (with several subprocesses - workers) and the CPU has several cores. One worker usually loads one batch. The next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is the secret for the speeding up.
I guess, you should use far less num_workers.
It would be interesting to know your batch size too, which you can adapt for the training process as well.
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
I was googling but could not find a concrete formula. I think that it is a rough estimation of how many cores has your CPU and Memory and Batch Size.
To choose the num_workers depends on what kind of computer you are using, what kind of dataset you are taking, and how much on-the-fly pre-processing your data requires.
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With