Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write large pandas dataframe as parquet with pyarrow

I'm trying to write a large pandas dataframe (shape 4247x10)

Nothing special, just using next code:

df_base = read_from_google_storage()
df_base.to_parquet(courses.CORE_PATH,
                   engine='pyarrow',
                   compression='gzip',
                   partition_cols=None)

I unsuccessfully tried to use different compressions, different partition_cols but fails anyway.

I mentioned It works fine with small dataframes (1000x10<) and it also works when I'm debugging and leave it enough time but in my case I'm getting an error:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Libs I'm using:

pandas==0.25.3
pyarrow==0.15.1
like image 286
sann05 Avatar asked Sep 19 '25 13:09

sann05


1 Answers

The issue might be related to this: https://issues.apache.org/jira/browse/PARQUET-1345 but I'm not sure.

Here is the workaround I found:

from pyarrow import Table
from pyarrow import parquet as pq


df_base = pd.read_csv('big_df.csv')

table = Table.from_pandas(df_base, nthreads=1)
print(table.columns)
print(table.num_rows)
pq.write_table(table, courses.CORE_PATH, compression='GZIP')

I'm not sure why exactly it's failing, but setting nthreads=1 helps to avoid SIGSEGV (Segmentation error)

like image 166
sann05 Avatar answered Sep 21 '25 02:09

sann05