I am reading huge Pandas (Version 18.1, on purpose) DataFrames stored in csv Format (~ summed up 30 GB). Working with read_csv however, memory consumption grows to the double of the initial csv. files --> 60 GB. I am aware of the chunksize
parameter. This however was way slower and didn't really reduce memory usage. I tried it with an 4 GB DataFrame. Having read the DataFrame, the script still consumed ~7 GB RAM. Here's my code:
df = None
for chunk in pandas.read_csv(fn, chunksize=50000):
if df is None:
df = chunk
else:
df = pandas.concat([df, chunk])
This is only a short version. I am also aware, that specifying the dtype saves memory. So here's my question. What's the best way (performance, memory) to read huge pandas DataFrames?
Depending on the types of operations you want to do on the dataframes, you might find dask useful. One of its key features is allowing operations on larger-than-memory dataframes. For example, to do a groupby on a larger-than-memory dataframe:
import dask.dataframe as dd
df = dd.read_csv(fn)
df_means = df.groupby(key).mean().compute()
Note the addition of compute()
at the end, as compared to a typical pandas groupby
operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With