I have a large dataset (~600 GB) stored as HDF5 format. As this is too large to fit in memory, I would like to convert this to Parquet format and use pySpark to perform some basic data preprocessing (normalization, finding correlation matrices, etc). However, I am unsure how to convert the entire dataset to Parquet without loading it into memory.
I looked at this gist: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py, but it appears that the entire dataset is being read into memory.
One thing I thought of was reading the HDF5 file in chunks and saving that incrementally into a Parquet file:
test_store = pd.HDFStore('/path/to/myHDFfile.h5')
nrows = test_store.get_storer('df').nrows
chunksize = N
for i in range(nrows//chunksize + 1):
# convert_to_Parquet() ...
But I can't find any documentation that would allow me to incrementally build up a Parquet file. Any links to further reading would be appreciated.
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages.
You can use pyarrow for this!
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):
stream = pd.read_hdf(h5_file, chunksize=chunksize)
for i, chunk in enumerate(stream):
print("Chunk {}".format(i))
if i == 0:
# Infer schema and open parquet file on first chunk
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
Thanks for your answer, I tried calling the below py script from CLI but it neither shows any error nor I could see converted parquet file.
And h5 files are not empty as well.enter image description here
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq
h5_file = "C:\Users...\tall.h5" parquet_file = "C:\Users...\my.parquet"
def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):
stream = pd.read_hdf(h5_file, chunksize=chunksize)
for i, chunk in enumerate(stream):
print("Chunk {}".format(i))
print(chunk.head())
if i == 0:
# Infer schema and open parquet file on first chunk
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With