Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I update an HDFStore?

Tags:

Consider the following hdfstore and dataframes df and df2

import pandas as pd

store = pd.HDFStore('test.h5')

midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB'))
df = pd.DataFrame(dict(C=range(6)), midx)

df

     C
A B   
0 X  0
  Y  1
  Z  2
1 X  3
  Y  4
  Z  5

midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB'))
df2 = pd.DataFrame(dict(C=range(6)), midx2)

df2

     C
A B   
0 V  0
  W  1
  X  2
1 V  3
  W  4
  X  5

I want to first write df to the store.

store.append('df', df)

store.get('df')

     C
A B   
0 X  0
  Y  1
  Z  2
1 X  3
  Y  4
  Z  5

At a later point in time I will have another dataframe that I want to update the store with. I want to overwrite the rows with the same index values as are in my new dataframe while keeping the old ones.

When I do

store.append('df', df2)

store.get('df')

     C
A B   
0 X  0
  Y  1
  Z  2
1 X  3
  Y  4
  Z  5
0 V  0
  W  1
  X  2
1 V  3
  W  4
  X  5

This isn't at all what I want. Notice that (0, 'X') and (1, 'X') are repeated. I can manipulate the combined dataframe and overwrite, but I expect to be working with a lot data where this wouldn't be feasible.

How do I update the store to get?

     C
A B   
0 V  0
  W  1
  X  2
  Y  1
  Z  2
1 V  3
  W  4
  X  5
  Y  4
  Z  5

You'll see that For each level of 'A', 'Y' and 'Z' are the same, 'V' and 'W' are new, and 'X' is updated.

What is the correct way to do this?

like image 985
piRSquared Avatar asked Aug 11 '17 18:08

piRSquared


1 Answers

Idea: remove matching rows (with matching index values) from the HDF first and then append df2 to HDFStore.

Problem: I couldn't find a way to use where="index in df2.index" for multi-index indexes.

Solution: first convert multiindexes to normal ones:

df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str)

df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str)

this yields:

In [348]: df
Out[348]:
     C
0_X  0
0_Y  1
0_Z  2
1_X  3
1_Y  4
1_Z  5

In [349]: df2
Out[349]:
     C
0_V  0
0_W  1
0_X  2
1_V  3
1_W  4
1_X  5

make sure that you use format='t' and data_columns=True (this will index save index and index all columns in the HDF5 file, allowing us to use them in the where clause) when you create/append HDF5 files:

store = pd.HDFStore('d:/temp/test1.h5')
store.append('df', df, format='t', data_columns=True)
store.close()

now we can first remove those rows from the HDFStore with matching indexes:

store = pd.HDFStore('d:/temp/test1.h5')

In [345]: store.remove('df', where="index in df2.index")
Out[345]: 2

and append df2:

In [346]: store.append('df', df2, format='t', data_columns=True, append=True)

Result:

In [347]: store.get('df')
Out[347]:
     C
0_Y  1
0_Z  2
1_Y  4
1_Z  5
0_V  0
0_W  1
0_X  2
1_V  3
1_W  4
1_X  5
like image 150
MaxU - stop WAR against UA Avatar answered Sep 22 '22 20:09

MaxU - stop WAR against UA