Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas memory usage inconsistencies

I have some memory inconsistencies when I am using a Pandas DataFrame.

Here is my code skeleton:

import pandas as pd
import numpy as np

columns_dtype = {'A': np.int16, 'B': np.int8, ...}
df = pd.read_csv('my_file.csv', dtype=columns_dtype)

That is basically just reading a csv file with pandas while controlling column data types. But then, when I am looking for how much memory is allocated to my program, the information does not seem coherent.

Info 1:

df.info(memory_usage='deep')

That gives : memory usage: 482.6 MB

Info 2:

import dill, sys
sys.getsizeof(dill.dumps(df))

That gives : 506049782 (so 506 Mb)

Info 3:

The RAM allocated to my program is : 1.1 GiB (or 1.2 Gb)

Additional info (but i do not think it is relevant):

the size of my_file.csv is 888 Mb (ls -lh)

The issue:

As I am just loading my csv file into a Pandas DataFrame object, why does my program needs more than 1 Gb of RAM whereas the object size is about 0.5 Gb ?

Many thanks

like image 807
DareYang Avatar asked Sep 04 '18 08:09

DareYang


People also ask

How do I reduce panda memory usage?

Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.

Are pandas memory efficient?

Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

How much memory can pandas handle?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

How do I check pandas memory?

memory_usage() function return the memory usage of each column in bytes. The memory usage can optionally include the contribution of the index and elements of object dtype. This value is displayed in DataFrame.info by default.


1 Answers

I'm not going to pretend to know the deep underlying use of dynamic memory to hold data in Pandas. Pandas is hungry when it loads large flat files, as a rule of thumb, pandas will use 5-10x the amount of ram as the file size you're loading to do analysis on.

To avoid these you can chunk the data when loading:

  df = pd.load_csv(file_path, chunksize = 30000)

or if you're doing analysis down columns:

  df = pd.load_csv(file_path, usecols = list_of_columns_index)

or both!

  df = pd.load_csv(file_path, chunksize = 30000, usecols = list_of_columns_index)

Hope this helps speed up your analysis.

like image 108
Tiblit Avatar answered Nov 06 '22 21:11

Tiblit