Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read specific columns from hdf5 file and pass conditions

I want to read only specific columns from HDF5 file and pass conditions on those columns. My concern is that I dont want to fetch all HDF5 file as dataframe in the memory. I want to get only my necessary columns with their conditions.

columns=['col1', 'col2']
condition= "col2==1"
groupname='\path\to\group'
Hdf5File=os.path.join('path\to\hdf5.h5')
with pd.HDFStore(Hdf5File, mode='r', format='table') as store:
     if groupname in store:
        df=pd.read_hdf(store, key=groupname, columns=columns, where=["col2==1"])

I get an error :

TypeError: cannot pass a column specification when reading a Fixed format store. this store must be selected in its entirety

Then I use below line which returns only specific columns:

df=store[groupname][columns]

But I dont know how can I pass condition on it.

like image 243
Safariba Avatar asked Jul 03 '17 09:07

Safariba


People also ask

How do I explore HDF5 files?

Open a HDF5/H5 file in HDFView hdf5 file on your computer. Open this file in HDFView. If you click on the name of the HDF5 file in the left hand window of HDFView, you can view metadata for the file.

Is HDF5 faster than csv?

The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .

Can pandas read HDF5?

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.


1 Answers

In order to be able to read HDF5 files conditionally, they must be saved in the table format and the corresponding columns must be indexed.

Demo:

df = pd.DataFrame(np.random.rand(100,5), columns=list('abcde'))
df.to_hdf('c:/temp/file.h5', 'df_key', format='t', data_columns=True)

In [10]: pd.read_hdf('c:/temp/file.h5', 'df_key', where="a > 0.5 and a < 0.75")
Out[10]:
           a         b         c         d         e
3   0.744123  0.515697  0.005335  0.017147  0.176254
5   0.555202  0.074128  0.874943  0.660555  0.776340
6   0.667145  0.278355  0.661728  0.705750  0.623682
8   0.701163  0.429860  0.223079  0.735633  0.476182
14  0.645130  0.302878  0.428298  0.969632  0.983690
15  0.633334  0.898632  0.881866  0.228983  0.216519
16  0.535633  0.906661  0.221823  0.608291  0.330101
17  0.715708  0.478515  0.002676  0.231314  0.075967
18  0.587762  0.262281  0.458854  0.811845  0.921100
21  0.551251  0.537855  0.906546  0.169346  0.063612
..       ...       ...       ...       ...       ...
68  0.610958  0.874373  0.785681  0.147954  0.966443
72  0.619666  0.818202  0.378740  0.416452  0.903129
73  0.500782  0.536064  0.697678  0.654602  0.054445
74  0.638659  0.518900  0.210444  0.308874  0.604929
76  0.696883  0.601130  0.402640  0.150834  0.264218
77  0.692149  0.963457  0.364050  0.152215  0.622544
85  0.737854  0.055863  0.346940  0.003907  0.678405
91  0.644924  0.840488  0.151190  0.566749  0.181861
93  0.710590  0.900474  0.061603  0.144200  0.946062
95  0.601144  0.288909  0.074561  0.615098  0.737097

[33 rows x 5 columns]

UPDATE:

If you can't change the HDF5 file, then consider the following technique:

In [13]: df = pd.concat([x.query("0.5 < a < 0.75")
                         for x in pd.read_hdf('c:/temp/file.h5', 'df_key', chunksize=10)],
                        ignore_index=True)

In [14]: df
Out[14]:
           a         b         c         d         e
0   0.744123  0.515697  0.005335  0.017147  0.176254
1   0.555202  0.074128  0.874943  0.660555  0.776340
2   0.667145  0.278355  0.661728  0.705750  0.623682
3   0.701163  0.429860  0.223079  0.735633  0.476182
4   0.645130  0.302878  0.428298  0.969632  0.983690
5   0.633334  0.898632  0.881866  0.228983  0.216519
6   0.535633  0.906661  0.221823  0.608291  0.330101
7   0.715708  0.478515  0.002676  0.231314  0.075967
8   0.587762  0.262281  0.458854  0.811845  0.921100
9   0.551251  0.537855  0.906546  0.169346  0.063612
..       ...       ...       ...       ...       ...
23  0.610958  0.874373  0.785681  0.147954  0.966443
24  0.619666  0.818202  0.378740  0.416452  0.903129
25  0.500782  0.536064  0.697678  0.654602  0.054445
26  0.638659  0.518900  0.210444  0.308874  0.604929
27  0.696883  0.601130  0.402640  0.150834  0.264218
28  0.692149  0.963457  0.364050  0.152215  0.622544
29  0.737854  0.055863  0.346940  0.003907  0.678405
30  0.644924  0.840488  0.151190  0.566749  0.181861
31  0.710590  0.900474  0.061603  0.144200  0.946062
32  0.601144  0.288909  0.074561  0.615098  0.737097

[33 rows x 5 columns]
like image 93
MaxU - stop WAR against UA Avatar answered Nov 11 '22 05:11

MaxU - stop WAR against UA