I have some memory inconsistencies when I am using a Pandas DataFrame. Here is my code skeleton: <pre class="prettyprint"><code>import pandas as pd import numpy as np columns_dtype = {'A': np.int16, 'B': np.int8, ...} df = pd.read_csv('my_file.csv', dtype=columns_dtype) </code></pre> That is basically just reading a csv file with pandas while controlling column data types. But then, when I am looking for how much memory is allocated to my program, the information does not seem coherent. Info 1: <pre class="prettyprint"><code>df.info(memory_usage='deep') </code></pre> That gives : <code>memory usage: 482.6 MB</code> Info 2: <pre class="prettyprint"><code>import dill, sys sys.getsizeof(dill.dumps(df)) </code></pre> That gives : <code>506049782</code> (so 506 Mb) Info 3: The RAM allocated to my program is : 1.1 GiB (or 1.2 Gb) Additional info (but i do not think it is relevant): the size of my_file.csv is 888 Mb (ls -lh) The issue: As I am just loading my csv file into a Pandas DataFrame object, why does my program needs more than 1 Gb of RAM whereas the object size is about 0.5 Gb ? Many thanks

I'm not going to pretend to know the deep underlying use of dynamic memory to hold data in Pandas. Pandas is hungry when it loads large flat files, as a rule of thumb, pandas will use 5-10x the amount of ram as the file size you're loading to do analysis on. To avoid these you can chunk the data when loading: <pre class="prettyprint"><code> df = pd.load_csv(file_path, chunksize = 30000) </code></pre> or if you're doing analysis down columns: <pre class="prettyprint"><code> df = pd.load_csv(file_path, usecols = list_of_columns_index) </code></pre> or both! <pre class="prettyprint"><code> df = pd.load_csv(file_path, chunksize = 30000, usecols = list_of_columns_index) </code></pre> Hope this helps speed up your analysis.

Pandas memory usage inconsistencies

Tags:

python

memory

pandas

I have some memory inconsistencies when I am using a Pandas DataFrame.

Here is my code skeleton:

import pandas as pd
import numpy as np

columns_dtype = {'A': np.int16, 'B': np.int8, ...}
df = pd.read_csv('my_file.csv', dtype=columns_dtype)

That is basically just reading a csv file with pandas while controlling column data types. But then, when I am looking for how much memory is allocated to my program, the information does not seem coherent.

Info 1:

df.info(memory_usage='deep')

That gives : memory usage: 482.6 MB

Info 2:

import dill, sys
sys.getsizeof(dill.dumps(df))

That gives : 506049782 (so 506 Mb)

Info 3:

The RAM allocated to my program is : 1.1 GiB (or 1.2 Gb)

Additional info (but i do not think it is relevant):

the size of my_file.csv is 888 Mb (ls -lh)

The issue:

As I am just loading my csv file into a Pandas DataFrame object, why does my program needs more than 1 Gb of RAM whereas the object size is about 0.5 Gb ?

Many thanks

807

asked Sep 04 '18 08:09

DareYang

1 Answers

I'm not going to pretend to know the deep underlying use of dynamic memory to hold data in Pandas. Pandas is hungry when it loads large flat files, as a rule of thumb, pandas will use 5-10x the amount of ram as the file size you're loading to do analysis on.

To avoid these you can chunk the data when loading:

  df = pd.load_csv(file_path, chunksize = 30000)

or if you're doing analysis down columns:

  df = pd.load_csv(file_path, usecols = list_of_columns_index)

or both!

  df = pd.load_csv(file_path, chunksize = 30000, usecols = list_of_columns_index)

Hope this helps speed up your analysis.

108

answered Nov 06 '22 21:11

Tiblit

Related questions
                            
                                Pandas how to place an array in a single dataframe cell?
                            
                                Python - Fine Uploader Server Side AWS Version 4 signing request
                            
                                Multithreaded pyodbc connection
                            
                                How can I replace OrderedDict with dict in a Python AST before literal_eval?
                            
                                Export environment variables at runtime with airflow
                            
                                How to detect if Chrome browser is headless in selenium?
                            
                                Pydub Slice Audio Segment By Sample
                            
                                KeyError when shifting column data back 1
                            
                                How can I make Ansible fail when the systemd service fails to start?
                            
                                WordNetLemmatizer: Different handling of wn.ADJ and wn.ADJ_SAT?
                            
                                Job is not performed by APScheduler's BackgroundScheduler
                            
                                Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's
                            
                                recognize_google speech recognition broken pipe python
                            
                                how can I save a string data to TFRecord?
                            
                                getting all possible combinations of a list in a form of sublists
                            
                                pytest - helper function or fixture, parametrization
                            
                                How to perform KerasClassifier model selection with varying input dimensions [duplicate]
                            
                                Strange cmap background_gradient behavior
                            
                                Cannot install kenlm package in anaconda environment
                            
                                Pandas reading html table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With