Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert large hdf5 dataset written via pandas/pytables to vaex

I have a very large dataset I write to hdf5 in chunks via append like so:

with pd.HDFStore(self.train_store_path) as train_store:
    for filepath in tqdm(filepaths):
        with open(filepath, 'rb') as file:
            frame = pickle.load(file)

        if frame.empty:
            os.remove(filepath)
            continue

        try:
            train_store.append(
                key='dataset', value=frame,
                min_itemsize=itemsize_dict)
            os.remove(filepath)
        except KeyError as e:
            print(e)
        except ValueError as e:
            print(frame)
            print(e)
        except Exception as e:
            print(e) 

The data is far too large to load into one DataFrame, so I would like to try out vaex for further processing. There's a few things I don't really understand though.

Since vaex uses a different representation in hdf5 than pandas/pytables (VOTable), I'm wondering how to go about converting between those two formats. I tried loading the data in chunks into pandas, converting it to a vaex DataFrame and then storing it, but there seems to be no way to append data to an existing vaex hdf5 file, at least none that I could find.

Is there really no way to create a large hdf5 dataset from within vaex? Is the only option to convert an existing dataset to vaex' representation (constructing the file via a python script or TOPCAT)?

Related to my previous question, if I work with a large dataset in vaex out-of-core, is it possible to then persist the results of any transformations i apply in vaex into the hdf5 file?

like image 221
sobek Avatar asked Mar 04 '23 02:03

sobek


1 Answers

The problem with this storage format is that it is not column-based, which does not play well with datasets with large number of rows, since if you only work with 1 column, for instance, the OS will probably also read large portions of the other columns, as well as the CPU cache gets polluted with it. It would be better to store them to a column based format such as vaex' hdf5 format, or arrow.

Converting to a vaex dataframe can done using:

import vaex
vaex_df = vaex.from_pandas(pandas_df, copy_index=False)

You can do this for each dataframe, and store them on disk as hdf5 or arrow:

vaex_df.export('batch_1.hdf5')  # or 'batch_1.arrow'

If you do this for many files, you can lazily (i.e. no memory copies will be made) concatenate them, or use the vaex.open function:

df1 = vaex.open('batch_1.hdf5')
df2 = vaex.open('batch_2.hdf5')
df = vaex.concat([df1, df2]) # will be seen as 1 dataframe without mem copy
df_altnerative = vaex.open('batch*.hdf5')  # same effect, but only needs 1 line

Regarding your question about the transformations:

If you do transformations to a dataframe, you can write out the computed values, or get the 'state', which includes the transformations:

import vaex
df = vaex.example()
df['difference'] = df.x - df.y
# df.export('materialized.hdf5', column_names=['difference'])  # do this if IO is fast, and memory abundant
# state = df.state_get()  # get state in memory
df.state_write('mystate.json') # or write as json


import vaex
df = vaex.example()
# df.join(vaex.open('materialized.hdf5'))  # join on rows number (super fast, 0 memory use!)
# df.state_set(state)  # or apply the state from memory
df.state_load('mystate.json')  # or from disk
df
like image 184
Maarten Breddels Avatar answered Mar 05 '23 16:03

Maarten Breddels