Converting HDF5 to Parquet without loading into memory

Tags:

I have a large dataset (~600 GB) stored as HDF5 format. As this is too large to fit in memory, I would like to convert this to Parquet format and use pySpark to perform some basic data preprocessing (normalization, finding correlation matrices, etc). However, I am unsure how to convert the entire dataset to Parquet without loading it into memory.

I looked at this gist: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py, but it appears that the entire dataset is being read into memory.

One thing I thought of was reading the HDF5 file in chunks and saving that incrementally into a Parquet file:

test_store = pd.HDFStore('/path/to/myHDFfile.h5')
nrows = test_store.get_storer('df').nrows
chunksize = N
for i in range(nrows//chunksize + 1):
    # convert_to_Parquet() ...

But I can't find any documentation that would allow me to incrementally build up a Parquet file. Any links to further reading would be appreciated.

208

asked Sep 11 '17 14:09

Eweler

2 Answers

You can use pyarrow for this!

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

    stream = pd.read_hdf(h5_file, chunksize=chunksize)

    for i, chunk in enumerate(stream):
        print("Chunk {}".format(i))

        if i == 0:
            # Infer schema and open parquet file on first chunk
            parquet_schema = pa.Table.from_pandas(df=chunk).schema
            parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)

    parquet_writer.close()

answered Oct 10 '22 17:10

ostrokach

Thanks for your answer, I tried calling the below py script from CLI but it neither shows any error nor I could see converted parquet file.

And h5 files are not empty as well.enter image description here

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq

h5_file = "C:\Users...\tall.h5" parquet_file = "C:\Users...\my.parquet"

def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

stream = pd.read_hdf(h5_file, chunksize=chunksize)

for i, chunk in enumerate(stream):
    print("Chunk {}".format(i))
    print(chunk.head())

    if i == 0:
        # Infer schema and open parquet file on first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)
parquet_writer.close()

answered Oct 10 '22 16:10

R S

Related questions
                            
                                How to merge two pandas dataframe in parallel (multithreading or multiprocessing)
                            
                                Error installing Numba on OS X
                            
                                python import module from a package
                            
                                How to perfectly convert one-element list to tuple in Python?
                            
                                what is difference between [None] and [] in python? [duplicate]
                            
                                Return tuple with smallest y value from list of tuples
                            
                                How to get a JSON response from a Google Chrome Selenium Webdriver client?
                            
                                Python how to sort list with float values [duplicate]
                            
                                Get package version for conda meta.yaml from source file
                            
                                Euclidean Distance Matrix Using Pandas
                            
                                How to get the column names of a DataFrame GroupBy object?
                            
                                Why db.session.remove() must be called?
                            
                                Pandas: How to fill null values with mean of a groupby?
                            
                                How to save the output of an IPython console to a file in Spyder?
                            
                                Return index value as string
                            
                                self-join with Pandas
                            
                                Cast NumPy array to/from custom C++ Matrix-class using pybind11
                            
                                Writing a python method that refers to both the instance and the class
                            
                                How to name a Pandas Series
                            
                                Does a plotly dash dashboard publish data online?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting HDF5 to Parquet without loading into memory

Tags:

python

pandas

hdf5

parquet

hdf

Eweler

People also ask

2 Answers

ostrokach

R S

Recent Activity

Donate For Us