I'm profiling a program that makes use of Pandas
to process some CSVs. I'm using psutil's Process.memory_info
to report the Virtual Memory Size (vms) and the Resident Set Size (rss) values. I'm also using Pandas DataFrame.memory_usage (df.memory_usage().sum())
to report the size of my dataframes
in memory.
There's a conflict between the reported vms
and df.memory_usage
values, where Pandas is reporting more memory just for the dataframe
than the Process.memory_info
call is reporting for the whole (single-threaded) process.
For example:
- rss: 334671872 B
- vms: 663515136 B
- df.memory_usage().sum(): 670244208 B
The Process.memory_info
call is made immediately after the memory_usage
call. My expected result was that df.memory_usage < vms
at all times, but this doesn't hold up. I assume I'm misinterpreting the meaning of the vms
value?
Total Memory Usage of Pandas Dataframe with info() To get the full memory usage, we provide memory_usage=”deep” argument to info(). We get all basic information about the dataframe and towards the end we also get the “memory usage: 1.1 MB” for the data frame.
You can use it by putting the @profile decorator around any function or method and running python -m memory_profiler myscript. You'll see line-by-line memory usage once your script exits.
Here is the reference related to your problem: use rss or vms to track memory. The relationship of RSS
and VMS
is bit confusing. You can learn about these concepts in detail . You should also know that how to calculate the total memory usage in this and this.
**TO SUMMARIZE AND COMPLEMENT MY OPINION**
:
RSS:
Resident set size is used to show how much memory is allocated to a process is in RAM
. Remember - It doesn't include memory which is swapped out
.
It involves memory from shared libraries, including all stack and heap memory.
VMS:
Virtual memory size includes all memory that the process can access. Which includes;
Memory that is swapped out, memory that is allocated but not used, and memory that is from shared libraries.
Example:
Let's assume, a Process-X
has a 500-K binary and is linked to 2500-K of shared libraries, has 200-K of stack/heap allocations of which 100-K is actually in memory (rest is swapped or unused), and it has only actually loaded 1000-K of the shared libraries and 400-K of its own binary then:
RSS: 400K + 1000K + 100K = 1500K
VMS: 500K + 2500K + 200K = 3200K
In this example, since part of the memory is shared, many processes may use it, so if you add up all of the RSS
values you can easily end up with more space than your system has.
As you can see when you simple run this;
import os
import psutil
process = psutil.Process(os.getpid())
print("vms: ", process.memory_info().vms)
print("rss:", process.memory_info().rss)
Output:
vms: 7217152
rss: 13975552
By simply adding, import pandas as pd
, you can see the difference.
import os
import psutil
import pandas as pd
process = psutil.Process(os.getpid())
print("vms: ", process.memory_info().vms)
print("rss:", process.memory_info().rss)
Here is output:
vms: 276295680
rss: 54116352
So, the memory that is allocated also may not be in RSS until it is actually used by the program. So if your program allocated a bunch of memory up front, then uses it over time;
- You could see RSS going up and VMS staying the same.
Now whether you go with df.memory_usage().sum()
or Process.memory_info
, I believe RSS
does include memory from dynamically linked libraries. So the sum of their RSS
will be more than the actual memory used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With