Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas read_csv memory consumption

I am reading huge Pandas (Version 18.1, on purpose) DataFrames stored in csv Format (~ summed up 30 GB). Working with read_csv however, memory consumption grows to the double of the initial csv. files --> 60 GB. I am aware of the chunksize parameter. This however was way slower and didn't really reduce memory usage. I tried it with an 4 GB DataFrame. Having read the DataFrame, the script still consumed ~7 GB RAM. Here's my code:

df = None

for chunk in pandas.read_csv(fn, chunksize=50000):
        if df is None:
                df = chunk
        else:
                df = pandas.concat([df, chunk])

This is only a short version. I am also aware, that specifying the dtype saves memory. So here's my question. What's the best way (performance, memory) to read huge pandas DataFrames?

like image 332
Hansi Avatar asked Oct 29 '22 14:10

Hansi


1 Answers

Depending on the types of operations you want to do on the dataframes, you might find dask useful. One of its key features is allowing operations on larger-than-memory dataframes. For example, to do a groupby on a larger-than-memory dataframe:

 import dask.dataframe as dd
 df = dd.read_csv(fn)
 df_means = df.groupby(key).mean().compute()

Note the addition of compute() at the end, as compared to a typical pandas groupby operation.

like image 75
jdmcbr Avatar answered Nov 15 '22 08:11

jdmcbr