Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I combine multiple .h5 file?

Everything that is available online is too complicated. My database is large to I exported it in parts. I now have three .h5 file and I would like to combine them into one .h5 file for further work. How can I do it?

like image 796
ktt_11 Avatar asked Sep 14 '25 17:09

ktt_11


1 Answers

For those that prefer using PyTables, I redid my h5py examples to show different ways to copy data between 2 HDF5 files. These examples use the same example HDF5 files as before. Each file only has one dataset. When you have multiple datasets, you can extend this process with walk_nodes() in Pytables.

All methods use glob() to find the HDF5 files used in the operations below.

Method 1: Create External Links
Similar to h5py, it creates 3 Groups in the new HDF5 file, each with an external link to the original data. The data is NOT copied.

import tables as tb
with tb.File('table_links_2.h5',mode='w') as h5fw:
    link_cnt = 0 
    for h5name in glob.glob('file*.h5'):
        link_cnt += 1
        h5fw.create_external_link('/', 'link'+str(link_cnt), h5name+':/')

Method 2: Copy Data 'as-is'
This copies the data from each dataset in the original file to the new file using the original dataset name. Dataset object is the same type as source HDF5 file. In this case, they are PyTable Arrays (because all columns are the same type). The datasets are copied using the name in the source HDF5 so each must have different names. The data is not merged into a single dataset.

with tb.File('table_copy_2.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        print (h5fr.root._v_children)
        h5fr.root._f_copy_children(h5fw.root)     

Method 3a: Merge all data into 1 Array
This copies and merges the data from each dataset in the original file into a single dataset in the new file. Again, the data is saved as a PyTables Array. There are no restrictions on the dataset names. First I read the data and append to a Numpy array. Once all files have been processed, the Numpy array is copied to the PyTables Array. This process holds the Numpy array in memory, so may not work for large datasets. You can avoid this limitation by using a Pytables EArray (Enlargeable Array). See Method 3b.

with tb.File('table_merge_2a.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        dset1 = h5fr.root._f_list_nodes()[0]
        arr_data = dset1[:]
        if row1 == 0 :
           all_data = arr_data.copy()
           row1 += arr_data.shape[0]
        else :
           all_data = np.append(all_data,arr_data,axis=0)
           row1 += arr_data.shape[0]
    tb.Array(h5fw.root,'alldata', obj=all_data )

Method 3b: Merge all data into 1 Enlargeable EArray
This is similar to the method above, but saves the data incrementally in a PyTables EArray. The EArray.append() method is used to add the data. This process reduces the memory issues in Method 3a.

with tb.File('table_merge_2b.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        dset1 = h5fr.root._f_list_nodes()[0]
        arr_data = dset1[:]
        if row1 == 0 :
           earr = h5fw.create_earray(h5fw.root,'alldata', 
                                     shape=(0,arr_data.shape[1]), obj=arr_data )
        else :
           earr.append(arr_data)
        row1 += arr_data.shape[0]   

Method 4: Merge all data into 1 Table
This example highlights the differences between h5py and PyTables. In h5py, the datasets can reference np.arrays or np.recarrays -- h5py deals with the different dtypes. In Pytables, Arrays (and CArrays and EArrays) reference nd.array data, and Tables reference np.recarray data. This example shows how to convert the nd.array data from the source files into np.recarray data suitable for Table objects. It also shows how to use Table.append() similar to EArray.append() in Method 3b.

with tb.File('table_append_2.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        dset1 = h5fr.root._f_list_nodes()[0]
        arr_data = dset1[:]
        ds_dt= ([ ('f1', float), ('f2', float), ('f3', float), ('f4', float), ('f5', float) ])
        recarr_data = np.rec.array(arr_data,dtype=ds_dt)
        if row1 == 0: 
            data_table = h5fw.create_table('/','alldata', obj=recarr_data)
        else :
            data_table.append(recarr_data)
        h5fw.flush()
        row1 += arr_data.shape[0]
like image 167
kcw78 Avatar answered Sep 17 '25 19:09

kcw78