Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting DataFrame values with enlargement

Tags:

python

pandas

I have two DataFrames (with DatetimeIndex) and want to update the first frame (the older one) with data from the second frame (the newer one).

The new frame may contain more recent data for rows already contained in the the old frame. In this case, data in the old frame should be overwritten with data from the new frame. Also the newer frame may have more columns / rows, than the first one. In this case the old frame should be enlarged by the data in the new frame.

Pandas docs state, that

"The .loc/.ix/[] operations can perform enlargement when setting a non-existant key for that axis"

and

"a DataFrame can be enlarged on either axis via .loc"

However this doesn't seem to work and throws a KeyError. Example:

In [195]: df1
Out[195]: 
                     A  B  C
2015-07-09 12:00:00  1  1  1
2015-07-09 13:00:00  1  1  1
2015-07-09 14:00:00  1  1  1
2015-07-09 15:00:00  1  1  1

In [196]: df2
Out[196]: 
                     A  B  C  D
2015-07-09 14:00:00  2  2  2  2
2015-07-09 15:00:00  2  2  2  2
2015-07-09 16:00:00  2  2  2  2
2015-07-09 17:00:00  2  2  2  2

In [197]: df1.loc[df2.index] = df2
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-197-74e630e87cf8> in <module>()
----> 1 df1.loc[df2.index] = df2

/.../pandas/core/indexing.pyc in __setitem__(self, key, value)
    112 
    113     def __setitem__(self, key, value):
--> 114         indexer = self._get_setitem_indexer(key)
    115         self._setitem_with_indexer(indexer, value)
    116 

/.../pandas/core/indexing.pyc in _get_setitem_indexer(self, key)
    107 
    108         try:
--> 109             return self._convert_to_indexer(key, is_setter=True)
    110         except TypeError:
    111             raise IndexingError(key)

/.../pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1110                 mask = check == -1
   1111                 if mask.any():
-> 1112                     raise KeyError('%s not in index' % objarr[mask])
   1113 
   1114                 return _values_from_object(indexer)

KeyError: "['2015-07-09T18:00:00.000000000+0200' '2015-07-09T19:00:00.000000000+0200'] not in index"

What is the best way (with respect to performance, as my real data is much larger) two achieve the desired updated and enlarged DataFrame. This is the result I would like to see:

                     A  B  C    D
2015-07-09 12:00:00  1  1  1  NaN
2015-07-09 13:00:00  1  1  1  NaN
2015-07-09 14:00:00  2  2  2    2
2015-07-09 15:00:00  2  2  2    2
2015-07-09 16:00:00  2  2  2    2
2015-07-09 17:00:00  2  2  2    2
like image 724
bmu Avatar asked Jul 09 '15 14:07

bmu


People also ask

How to set the value of a particular index in Dataframe?

Lets use the dataframe.set_value () function to set value of a particular index. Notice, for the non-existent row and column in the dataframe, a new row and column has been inserted. Writing code in comment? Please use ide.geeksforgeeks.org , generate link and share the link here.

How do I set a value in pandas Dataframe?

Pandas dataframe.set_value () function put a single value at passed column and index. It takes the axis labels as input and a scalar value to be placed at the specified index in the dataframe. Alternative to this function is .at [] or .iat [].

How to modify a column in a Dataframe object?

If you want to modify any column’s values or even if you want to add a column with different values, then you have various methods to do so: Just add a list (Method 1) SYNTAX: dataFrameObject [column_to_be_changed] = [list_of_ columnName _to_replace_with] Using keyword at (Method 2) SYNTAX: ...

How do I get the value of a Dataframe in SQL?

DataFrame objects have a query () method that allows selection using an expression. You can get the value of the frame where column b has values between the values of columns a and c. For example: Do the same thing but fall back on a named index if there is no column with the name a.


3 Answers

df2.combine_first(df1) (documentation) seems to serve your requirement; PFB code snippet & output

import pandas as pd

print 'pandas-version: ', pd.__version__

df1 = pd.DataFrame.from_records([('2015-07-09 12:00:00',1,1,1),
                                 ('2015-07-09 13:00:00',1,1,1),
                                 ('2015-07-09 14:00:00',1,1,1),
                                 ('2015-07-09 15:00:00',1,1,1)],
                                columns=['Dt', 'A', 'B', 'C']).set_index('Dt')
# print df1

df2 = pd.DataFrame.from_records([('2015-07-09 14:00:00',2,2,2,2),
                                 ('2015-07-09 15:00:00',2,2,2,2),
                                 ('2015-07-09 16:00:00',2,2,2,2),
                                 ('2015-07-09 17:00:00',2,2,2,2),],
                               columns=['Dt', 'A', 'B', 'C', 'D']).set_index('Dt')
res_combine1st = df2.combine_first(df1)
print res_combine1st

output

pandas-version:  0.15.2
                     A  B  C   D
Dt                              
2015-07-09 12:00:00  1  1  1 NaN
2015-07-09 13:00:00  1  1  1 NaN
2015-07-09 14:00:00  2  2  2   2
2015-07-09 15:00:00  2  2  2   2
2015-07-09 16:00:00  2  2  2   2
2015-07-09 17:00:00  2  2  2   2
like image 68
Joshua Baboo Avatar answered Nov 15 '22 20:11

Joshua Baboo


You can use the combine function.

import pandas as pd

# your data
# ===========================================================
df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H'))

df2 = pd.DataFrame(np.ones(16).reshape(4,4)*2, columns='A B C D'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))

# processing
# =====================================================
# reindex to populate NaN
result = df2.reindex(np.union1d(df1.index, df2.index))

Out[248]: 
                      A   B   C   D
2015-07-09 12:00:00 NaN NaN NaN NaN
2015-07-09 13:00:00 NaN NaN NaN NaN
2015-07-09 14:00:00   2   2   2   2
2015-07-09 15:00:00   2   2   2   2
2015-07-09 16:00:00   2   2   2   2
2015-07-09 17:00:00   2   2   2   2

combiner = lambda x, y: np.where(x.isnull(), y, x)

# use df1 to update result
result.combine(df1, combiner)

Out[249]: 
                     A  B  C   D
2015-07-09 12:00:00  1  1  1 NaN
2015-07-09 13:00:00  1  1  1 NaN
2015-07-09 14:00:00  2  2  2   2
2015-07-09 15:00:00  2  2  2   2
2015-07-09 16:00:00  2  2  2   2
2015-07-09 17:00:00  2  2  2   2

# maybe fillna(method='ffill') if you like
like image 20
Jianxun Li Avatar answered Nov 15 '22 20:11

Jianxun Li


In addition to previous answer, after reindexing you can use

result.fillna(df1, inplace=True)

so based on Jianxun Li's code (extended with one more column) you can try this

# your data
# ===========================================================
df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H'))
df2 = pd.DataFrame(np.ones(20).reshape(4,5)*2, columns='A B C D E'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))

# processing
# =====================================================
# reindex to populate NaN
result = df2.reindex(np.union1d(df1.index, df2.index))
# fill NaN from df1
result.fillna(df1, inplace=True)

Out[3]:             
                     A  B  C   D   E
2015-07-09 12:00:00  1  1  1 NaN NaN
2015-07-09 13:00:00  1  1  1 NaN NaN
2015-07-09 14:00:00  2  2  2   2   2
2015-07-09 15:00:00  2  2  2   2   2
2015-07-09 16:00:00  2  2  2   2   2
2015-07-09 17:00:00  2  2  2   2   2
like image 26
herbico Avatar answered Nov 15 '22 19:11

herbico