Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting HDF5 to Parquet without loading into memory

I have a large dataset (~600 GB) stored as HDF5 format. As this is too large to fit in memory, I would like to convert this to Parquet format and use pySpark to perform some basic data preprocessing (normalization, finding correlation matrices, etc). However, I am unsure how to convert the entire dataset to Parquet without loading it into memory.

I looked at this gist: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py, but it appears that the entire dataset is being read into memory.

One thing I thought of was reading the HDF5 file in chunks and saving that incrementally into a Parquet file:

test_store = pd.HDFStore('/path/to/myHDFfile.h5')
nrows = test_store.get_storer('df').nrows
chunksize = N
for i in range(nrows//chunksize + 1):
    # convert_to_Parquet() ...

But I can't find any documentation that would allow me to incrementally build up a Parquet file. Any links to further reading would be appreciated.

like image 208
Eweler Avatar asked Sep 11 '17 14:09

Eweler


People also ask

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

How is data stored in parquet format?

Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages.


2 Answers

You can use pyarrow for this!

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

    stream = pd.read_hdf(h5_file, chunksize=chunksize)

    for i, chunk in enumerate(stream):
        print("Chunk {}".format(i))

        if i == 0:
            # Infer schema and open parquet file on first chunk
            parquet_schema = pa.Table.from_pandas(df=chunk).schema
            parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)

    parquet_writer.close()
like image 50
ostrokach Avatar answered Oct 10 '22 17:10

ostrokach


Thanks for your answer, I tried calling the below py script from CLI but it neither shows any error nor I could see converted parquet file.

And h5 files are not empty as well.enter image description here

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq

h5_file = "C:\Users...\tall.h5" parquet_file = "C:\Users...\my.parquet"

def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

stream = pd.read_hdf(h5_file, chunksize=chunksize)

for i, chunk in enumerate(stream):
    print("Chunk {}".format(i))
    print(chunk.head())

    if i == 0:
        # Infer schema and open parquet file on first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)
parquet_writer.close()
like image 22
R S Avatar answered Oct 10 '22 16:10

R S