pandas read_csv memory consumption

Question

I am reading huge Pandas (Version 18.1, on purpose) DataFrames stored in csv Format (~ summed up 30 GB). Working with read_csv however, memory consumption grows to the double of the initial csv. files --> 60 GB. I am aware of the chunksize parameter. This however was way slower and didn't really reduce memory usage. I tried it with an 4 GB DataFrame. Having read the DataFrame, the script still consumed ~7 GB RAM. Here's my code:

df = None

for chunk in pandas.read_csv(fn, chunksize=50000):
        if df is None:
                df = chunk
        else:
                df = pandas.concat([df, chunk])

This is only a short version. I am also aware, that specifying the dtype saves memory. So here's my question. What's the best way (performance, memory) to read huge pandas DataFrames?

jdmcbr · Accepted Answer

Depending on the types of operations you want to do on the dataframes, you might find dask useful. One of its key features is allowing operations on larger-than-memory dataframes. For example, to do a groupby on a larger-than-memory dataframe:

 import dask.dataframe as dd
 df = dd.read_csv(fn)
 df_means = df.groupby(key).mean().compute()

Note the addition of compute() at the end, as compared to a typical pandas groupby operation.

pandas read_csv memory consumption

Tags:

memory-management

memory

python-3.x

pandas

out-of-memory

Hansi

1 Answers

jdmcbr

Recent Activity

Donate For Us

pandas read_csv memory consumption

Tags:

memory-management

memory

python-3.x

pandas

out-of-memory

Hansi

1 Answers

jdmcbr

Related questions

Recent Activity

Donate For Us