Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Memory Usage by PyTorch DataLoader Workers

When running a PyTorch training program with num_workers=32 for DataLoader, htop shows 33 python process each with 32 GB of VIRT and 15 GB of RES.

Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? htop shows only about 50 GB of RAM and 20 GB of swap is being used on the entire machine with 128 GB of RAM. So, how do we explain the discrepancy?

Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?

Thank you

like image 246
Athena Wisdom Avatar asked Aug 21 '20 12:08

Athena Wisdom


People also ask

How does PyTorch allocate memory?

Memory management. PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi .

What does DataLoader do in PyTorch?

Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

What is worker in DataLoader?

Num_workers tells the data loader instance how many sub-processes to use for data loading. If the num_worker is zero (default) the GPU has to weight for CPU to load data. Theoretically, greater the num_workers, more efficiently the CPU load data and less the GPU has to wait.

What is pin memory in DataLoader?

In CUDA terms, pinned memory does not mean GPU memory but non-paged CPU memory. The benefits and rationale are provided here, but the gist of it is that this flag allows the x.cuda() operation (which you still have to execute as usually) to avoid one implicit CPU-to-CPU copy, which makes it a bit more performant.


Video Answer


1 Answers

Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory?

Not necessary. You have a worker process (with several subprocesses - workers) and the CPU has several cores. One worker usually loads one batch. The next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is the secret for the speeding up.

I guess, you should use far less num_workers.

It would be interesting to know your batch size too, which you can adapt for the training process as well.

Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?

I was googling but could not find a concrete formula. I think that it is a rough estimation of how many cores has your CPU and Memory and Batch Size.

To choose the num_workers depends on what kind of computer you are using, what kind of dataset you are taking, and how much on-the-fly pre-processing your data requires.

HTH

like image 141
j35t3r Avatar answered Oct 10 '22 09:10

j35t3r