Consider the following hdfstore
and dataframes df
and df2
import pandas as pd
store = pd.HDFStore('test.h5')
midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB'))
df = pd.DataFrame(dict(C=range(6)), midx)
df
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB'))
df2 = pd.DataFrame(dict(C=range(6)), midx2)
df2
C
A B
0 V 0
W 1
X 2
1 V 3
W 4
X 5
I want to first write df
to the store.
store.append('df', df)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
At a later point in time I will have another dataframe that I want to update the store with. I want to overwrite the rows with the same index values as are in my new dataframe while keeping the old ones.
When I do
store.append('df', df2)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
0 V 0
W 1
X 2
1 V 3
W 4
X 5
This isn't at all what I want. Notice that (0, 'X')
and (1, 'X')
are repeated. I can manipulate the combined dataframe and overwrite, but I expect to be working with a lot data where this wouldn't be feasible.
How do I update the store to get?
C
A B
0 V 0
W 1
X 2
Y 1
Z 2
1 V 3
W 4
X 5
Y 4
Z 5
You'll see that For each level of 'A'
, 'Y'
and 'Z'
are the same, 'V'
and 'W'
are new, and 'X'
is updated.
What is the correct way to do this?
Idea: remove matching rows (with matching index values) from the HDF first and then append df2
to HDFStore.
Problem: I couldn't find a way to use where="index in df2.index"
for multi-index indexes.
Solution: first convert multiindexes to normal ones:
df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str)
df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str)
this yields:
In [348]: df
Out[348]:
C
0_X 0
0_Y 1
0_Z 2
1_X 3
1_Y 4
1_Z 5
In [349]: df2
Out[349]:
C
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
make sure that you use format='t'
and data_columns=True
(this will index save index and index all columns in the HDF5 file, allowing us to use them in the where
clause) when you create/append HDF5 files:
store = pd.HDFStore('d:/temp/test1.h5')
store.append('df', df, format='t', data_columns=True)
store.close()
now we can first remove those rows from the HDFStore with matching indexes:
store = pd.HDFStore('d:/temp/test1.h5')
In [345]: store.remove('df', where="index in df2.index")
Out[345]: 2
and append df2
:
In [346]: store.append('df', df2, format='t', data_columns=True, append=True)
Result:
In [347]: store.get('df')
Out[347]:
C
0_Y 1
0_Z 2
1_Y 4
1_Z 5
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With