Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load multiple parquet files into dataframe for analysis

Tags:

I have several .parquet files each with shape (1126399, 503) and size of 13MB. As far as I know and from what I have read this should be able to be handled just fine on a local machine. I am trying to get them into a pandas dataframe to run some analysis but having trouble doing so. Saving them to a CSV file is too costly as the files become extremely large and loading them directly into several dataframes and then concatenating gives me memory errors. I've never worked with .parquet files and not sure what the best path forward is or how to use the files to actually do some analysis with the data.

At first, I tried:

import pandas as pd
import pyarrow.parquet as pq

# This is repeated for all files
p0 = pq.read_table('part0.parquet') # each part increases python's memory usage by ~14%
df0 = part0.to_pandas() # each frame increases python's memory usage by additional ~14%

# Concatenate all dataframes together
df = pd.concat([df0, df1, df2, df3, df4, df6, df7], ignore_index=True)

This was causing me to run out of memory. I am running on a system with 12 cores and 32GB of memory. I thought I'd be more efficient and tried looping through and deleting the files that were no longer needed:

import pandas as pd

# Loop through files and load into a dataframe
df = pd.read_parquet('part0.parquet', engine='pyarrow')
files = ['part1.parquet', 'part2.parquet', 'part3.parquet'] # in total there are 6 files

for file in files:
    data = pd.read_parque(file)
    df = df.append(data, ignore_index=True)
    del data

Unfortunately, neither of these worked. Any and all help is greatly appreciated.

like image 775
schaefferda Avatar asked Oct 02 '18 17:10

schaefferda


1 Answers

I opened https://issues.apache.org/jira/browse/ARROW-3424 about at least making a function in pyarrow that will load a collection of file paths as efficiently as possible. You can load them individually with pyarrow.parquet.read_table, concatenate the pyarrow.Table objects with pyarrow.concat_tables, then call Table.to_pandas to convert to pandas.DataFrame. That will be much more efficient then concatenating with pandas

like image 134
Wes McKinney Avatar answered Nov 11 '22 05:11

Wes McKinney