how to efficiently split a large dataframe into many parquet files?

Tags:

Consider the following dataframe

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T')

dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)),
                          'string_col' : pd.util.testing.rands_array(8,len(idx))},
                           index = idx)

dataframe
Out[30]: 
                     numeric_col string_col
2017-01-01 12:00:00       0.4069   wWw62tq6
2017-01-01 12:01:00       0.2050   SleB4f6K
2017-01-01 12:02:00       0.5180   cXBvEXdh
2017-01-01 12:03:00       0.3069   r9kYsJQC
2017-01-01 12:04:00       0.3571   F2JjUGgO
2017-01-01 12:05:00       0.3170   8FPC4Pgz
2017-01-01 12:06:00       0.9454   ybeNnZGV
2017-01-01 12:07:00       0.3353   zSLtYPWF
2017-01-01 12:08:00       0.8510   tDZJrdMM
2017-01-01 12:09:00       0.4948   S1Rm2Sqb
2017-01-01 12:10:00       0.0279   TKtmys86
2017-01-01 12:11:00       0.5709   ww0Pe1cf
2017-01-01 12:12:00       0.8274   b07wKPsR
2017-01-01 12:13:00       0.3848   9vKTq3M3
2017-01-01 12:14:00       0.6579   crYxFvlI
2017-01-01 12:15:00       0.6568   yGUnCW6n

I need to write this dataframe into many parquet files. Of course, the following works:

table = pa.Table.from_pandas(dataframe)
pq.write_table(table, '\\\\mypath\\dataframe.parquet', flavor ='spark')

My issue is that the resulting (single) parquet file gets too big.

How can I efficiently (memory-wise, speed-wise) split the writing into daily parquet files (and keep the spark flavor)? These daily files will be easier to read in parallel with spark later on.

Thanks!

844

asked Jun 12 '18 19:06

ℕʘʘḆḽḘ

1 Answers

Making a string columndt based off of the index will then allow you to write out the data partitioned by date by running

pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], flavor ='spark')

Answer is based off of this source (note, the source incorrectly lists the partition argument as partition_columns)

199

answered Nov 16 '22 00:11

David

Related questions
                            
                                botocore.exceptions.ClientError An error occurred (SignatureDoesNotMatch) when calling the GetObject operation
                            
                                Conda importing one environment.yml into another
                            
                                DecisionTreeClassifier predict_proba returns 0 or 1
                            
                                Run py.test test in different process
                            
                                Can one have Python receive a variable-length string array from C#?
                            
                                In the Django REST framework, how are the default permission classes combined with per-view(set) ones?
                            
                                Questions on using ttk.Style()?
                            
                                How to incorporate data from two distinct sources (that don't have a RDBMS relationship) in a single serializer?
                            
                                Where is the API documentation for boto3 resources?
                            
                                matplotlib's xkcd() not working
                            
                                How to plot the difference of two distributions in a seaborn?
                            
                                Difference between 'ctx' and 'self' in python?
                            
                                Pandas Data Frame how to merge columns
                            
                                Pip install pandas: installing dependencies error
                            
                                How to convert requests.RequestsCookieJar to string
                            
                                Download pretrained ImageNet model of ResNet, VGG, etc. (.PB file)
                            
                                python3 command not found after installing python with pyenv
                            
                                plot mouse movement Python
                            
                                Split Column containing lists into different rows in pandas [duplicate]
                            
                                How to use pyunpack to unpack .7z file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to efficiently split a large dataframe into many parquet files?

Tags:

python

pandas

parquet

pyarrow

ℕʘʘḆḽḘ

People also ask

1 Answers

David

Recent Activity

Donate For Us