Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to efficiently split a large dataframe into many parquet files?

Consider the following dataframe

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T')

dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)),
                          'string_col' : pd.util.testing.rands_array(8,len(idx))},
                           index = idx)

dataframe
Out[30]: 
                     numeric_col string_col
2017-01-01 12:00:00       0.4069   wWw62tq6
2017-01-01 12:01:00       0.2050   SleB4f6K
2017-01-01 12:02:00       0.5180   cXBvEXdh
2017-01-01 12:03:00       0.3069   r9kYsJQC
2017-01-01 12:04:00       0.3571   F2JjUGgO
2017-01-01 12:05:00       0.3170   8FPC4Pgz
2017-01-01 12:06:00       0.9454   ybeNnZGV
2017-01-01 12:07:00       0.3353   zSLtYPWF
2017-01-01 12:08:00       0.8510   tDZJrdMM
2017-01-01 12:09:00       0.4948   S1Rm2Sqb
2017-01-01 12:10:00       0.0279   TKtmys86
2017-01-01 12:11:00       0.5709   ww0Pe1cf
2017-01-01 12:12:00       0.8274   b07wKPsR
2017-01-01 12:13:00       0.3848   9vKTq3M3
2017-01-01 12:14:00       0.6579   crYxFvlI
2017-01-01 12:15:00       0.6568   yGUnCW6n

I need to write this dataframe into many parquet files. Of course, the following works:

table = pa.Table.from_pandas(dataframe)
pq.write_table(table, '\\\\mypath\\dataframe.parquet', flavor ='spark')

My issue is that the resulting (single) parquet file gets too big.

How can I efficiently (memory-wise, speed-wise) split the writing into daily parquet files (and keep the spark flavor)? These daily files will be easier to read in parallel with spark later on.

Thanks!

like image 844
ℕʘʘḆḽḘ Avatar asked Jun 12 '18 19:06

ℕʘʘḆḽḘ


People also ask

Is PyArrow or Fastparquet better?

According to it, pyarrow is faster than fastparquet, little wonder it is the default engine used in dask.

Can you partition Parquet files?

An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.

Is Parquet faster than CSV pandas?

Parquet uses efficient data compression and encoding scheme for fast data storing and retrieval. Parquet with “gzip” compression (for storage): It is slightly faster to export than just . csv (if the CSV needs to be zipped, then parquet is much faster). Importing is about 2x times faster than CSV.

What is fast Parquet?

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows. Not all parts of the parquet-format have been implemented yet or tested e.g. see the Todos linked below.


1 Answers

Making a string columndt based off of the index will then allow you to write out the data partitioned by date by running

pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], flavor ='spark')

Answer is based off of this source (note, the source incorrectly lists the partition argument as partition_columns)

like image 199
David Avatar answered Nov 16 '22 00:11

David