I have several .parquet files each with shape (1126399, 503)
and size of 13MB. As far as I know and from what I have read this should be able to be handled just fine on a local machine. I am trying to get them into a pandas dataframe to run some analysis but having trouble doing so. Saving them to a CSV file is too costly as the files become extremely large and loading them directly into several dataframes and then concatenating gives me memory errors. I've never worked with .parquet files and not sure what the best path forward is or how to use the files to actually do some analysis with the data.
At first, I tried:
import pandas as pd
import pyarrow.parquet as pq
# This is repeated for all files
p0 = pq.read_table('part0.parquet') # each part increases python's memory usage by ~14%
df0 = part0.to_pandas() # each frame increases python's memory usage by additional ~14%
# Concatenate all dataframes together
df = pd.concat([df0, df1, df2, df3, df4, df6, df7], ignore_index=True)
This was causing me to run out of memory. I am running on a system with 12 cores and 32GB of memory. I thought I'd be more efficient and tried looping through and deleting the files that were no longer needed:
import pandas as pd
# Loop through files and load into a dataframe
df = pd.read_parquet('part0.parquet', engine='pyarrow')
files = ['part1.parquet', 'part2.parquet', 'part3.parquet'] # in total there are 6 files
for file in files:
data = pd.read_parque(file)
df = df.append(data, ignore_index=True)
del data
Unfortunately, neither of these worked. Any and all help is greatly appreciated.
I opened https://issues.apache.org/jira/browse/ARROW-3424 about at least making a function in pyarrow that will load a collection of file paths as efficiently as possible. You can load them individually with pyarrow.parquet.read_table
, concatenate the pyarrow.Table
objects with pyarrow.concat_tables
, then call Table.to_pandas
to convert to pandas.DataFrame
. That will be much more efficient then concatenating with pandas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With