Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using psutil.Process.memory_info memory usage differs from Pandas.memory_usage

I'm profiling a program that makes use of Pandas to process some CSVs. I'm using psutil's Process.memory_info to report the Virtual Memory Size (vms) and the Resident Set Size (rss) values. I'm also using Pandas DataFrame.memory_usage (df.memory_usage().sum()) to report the size of my dataframes in memory.

There's a conflict between the reported vms and df.memory_usage values, where Pandas is reporting more memory just for the dataframe than the Process.memory_info call is reporting for the whole (single-threaded) process.

For example:

  • rss: 334671872 B
  • vms: 663515136 B
  • df.memory_usage().sum(): 670244208 B

The Process.memory_info call is made immediately after the memory_usage call. My expected result was that df.memory_usage < vms at all times, but this doesn't hold up. I assume I'm misinterpreting the meaning of the vms value?

like image 471
musingsole Avatar asked Oct 14 '19 15:10

musingsole


People also ask

How much memory does a Pandas Dataframe use?

Total Memory Usage of Pandas Dataframe with info() To get the full memory usage, we provide memory_usage=”deep” argument to info(). We get all basic information about the dataframe and towards the end we also get the “memory usage: 1.1 MB” for the data frame.

How does Python calculate memory consumption?

You can use it by putting the @profile decorator around any function or method and running python -m memory_profiler myscript. You'll see line-by-line memory usage once your script exits.


1 Answers

Here is the reference related to your problem: use rss or vms to track memory. The relationship of RSS and VMS is bit confusing. You can learn about these concepts in detail . You should also know that how to calculate the total memory usage in this and this.

**TO SUMMARIZE AND COMPLEMENT MY OPINION**:


RSS:

Resident set size is used to show how much memory is allocated to a process is in RAM. Remember - It doesn't include memory which is swapped out.

It involves memory from shared libraries, including all stack and heap memory.

VMS:

Virtual memory size includes all memory that the process can access. Which includes;

Memory that is swapped out, memory that is allocated but not used, and memory that is from shared libraries.

Example:

Let's assume, a Process-X has a 500-K binary and is linked to 2500-K of shared libraries, has 200-K of stack/heap allocations of which 100-K is actually in memory (rest is swapped or unused), and it has only actually loaded 1000-K of the shared libraries and 400-K of its own binary then:

RSS: 400K + 1000K + 100K = 1500K
VMS: 500K + 2500K + 200K = 3200K

In this example, since part of the memory is shared, many processes may use it, so if you add up all of the RSS values you can easily end up with more space than your system has.

As you can see when you simple run this;

import os
import psutil
process = psutil.Process(os.getpid())
print("vms: ", process.memory_info().vms)
print("rss:", process.memory_info().rss)

Output:

vms: 7217152

rss: 13975552

By simply adding, import pandas as pd, you can see the difference.

import os
import psutil
import pandas as pd
process = psutil.Process(os.getpid())
print("vms: ", process.memory_info().vms)
print("rss:", process.memory_info().rss)

Here is output:

vms: 276295680

rss: 54116352

So, the memory that is allocated also may not be in RSS until it is actually used by the program. So if your program allocated a bunch of memory up front, then uses it over time;

  • You could see RSS going up and VMS staying the same.

Now whether you go with df.memory_usage().sum() or Process.memory_info, I believe RSS does include memory from dynamically linked libraries. So the sum of their RSS will be more than the actual memory used.

like image 81
Muhammad Usman Bashir Avatar answered Sep 21 '22 17:09

Muhammad Usman Bashir