Consider the following dataframe
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa
idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T')
dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)),
'string_col' : pd.util.testing.rands_array(8,len(idx))},
index = idx)
dataframe
Out[30]:
numeric_col string_col
2017-01-01 12:00:00 0.4069 wWw62tq6
2017-01-01 12:01:00 0.2050 SleB4f6K
2017-01-01 12:02:00 0.5180 cXBvEXdh
2017-01-01 12:03:00 0.3069 r9kYsJQC
2017-01-01 12:04:00 0.3571 F2JjUGgO
2017-01-01 12:05:00 0.3170 8FPC4Pgz
2017-01-01 12:06:00 0.9454 ybeNnZGV
2017-01-01 12:07:00 0.3353 zSLtYPWF
2017-01-01 12:08:00 0.8510 tDZJrdMM
2017-01-01 12:09:00 0.4948 S1Rm2Sqb
2017-01-01 12:10:00 0.0279 TKtmys86
2017-01-01 12:11:00 0.5709 ww0Pe1cf
2017-01-01 12:12:00 0.8274 b07wKPsR
2017-01-01 12:13:00 0.3848 9vKTq3M3
2017-01-01 12:14:00 0.6579 crYxFvlI
2017-01-01 12:15:00 0.6568 yGUnCW6n
I need to write this dataframe into many parquet files. Of course, the following works:
table = pa.Table.from_pandas(dataframe)
pq.write_table(table, '\\\\mypath\\dataframe.parquet', flavor ='spark')
My issue is that the resulting (single) parquet
file gets too big.
How can I efficiently (memory-wise, speed-wise) split the writing into daily
parquet files (and keep the spark
flavor)? These daily files will be easier to read in parallel with spark
later on.
Thanks!
According to it, pyarrow is faster than fastparquet, little wonder it is the default engine used in dask.
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.
Parquet uses efficient data compression and encoding scheme for fast data storing and retrieval. Parquet with “gzip” compression (for storage): It is slightly faster to export than just . csv (if the CSV needs to be zipped, then parquet is much faster). Importing is about 2x times faster than CSV.
fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows. Not all parts of the parquet-format have been implemented yet or tested e.g. see the Todos linked below.
Making a string columndt
based off of the index will then allow you to write out the data partitioned by date by running
pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], flavor ='spark')
Answer is based off of this source (note, the source incorrectly lists the partition argument as partition_columns
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With