This is something that I'm not quite sure on from the docs.
Say I have two dataframes, with overlapping data.
DF1 has a DateTimeIndex starting at 07:00:00 and ending at 09:30:00.
DF2 has a DateTimeIndex starting at 07:00:00 and ending at 11:30:00.
DF2 is an updated DF1, but some rows may be added as compared to DF1 before the end time of DF1. So DF2 might have 200 rows added from 9:20:00-9:30:00 on the update, and then everything after 09:30:00 also is new of course.
If I use:
DF1.append(DF2)
Will I get all the rows from DF2 that are new? Or does pandas only go in and take the rows from after the end of DF1? Beyond that, DF2 additions could actually be at the same Time as a DF1 row but it would have different contents. Would pandas handle this as well?
If pandas doesn't handle this, what is the best way to do this myself?
In [489]: df
Out[489]:
Row1 Row3
2013-11-05 08:00:00 2 NaN
2013-11-05 09:00:00 4 NaN
2013-11-05 09:06:00 6 5
In [490]: df2
Out[490]:
Row1 Row3
2013-11-05 08:00:00 2 NaN
2013-11-05 09:00:00 5 NaN
2013-11-05 09:09:00 6 5
In [491]: df.append(df2)
Out[491]:
Row1 Row3
2013-11-05 08:00:00 2 NaN
2013-11-05 09:00:00 4 NaN
2013-11-05 09:06:00 6 5
2013-11-05 08:00:00 2 NaN
2013-11-05 09:00:00 5 NaN
2013-11-05 09:09:00 6 5
I want df.append(df2) in this case to be:
In [491]: df.append(df2)
Out[491]:
Row1 Row3
2013-11-05 08:00:00 2 NaN
2013-11-05 09:00:00 4 NaN
2013-11-05 09:06:00 6 5
<strike>2013-11-05 08:00:00 2 NaN</strike>
2013-11-05 09:00:00 5 NaN
2013-11-05 09:09:00 6 5
edit2:
I was previously doing this:
last = df.ix[-1].name
to_append = df2[last:]
new_df = df.append(to_append)
This unfortunately cuts out the rows that are new but before the timestamp of the last row of my previous DataFrame
Append is similar to one of python list, you will get both dataframes "stacked" together. Whether a ValueError is raised in case of indexes with duplicates is controlled by verify_integrity param to append, defaulting to False.
>>> df = pd.DataFrame.from_dict({'col':{'row': 1}})
>>> df
col
row 1
>>> df.append(df).index
Index([u'row', u'row'], dtype=object)
>>> df.append(df)
col
row 1
row 1
>>> df.append(df, verify_integrity=True)
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: ['row']
for merging with replacement use combine_first:
>>> mdf = pd.DataFrame.from_dict({'col':{'row': 2, 'new':3}})
>>> df.combine_first(mdf) # values from df overwrite those of mdf
col
new 3
row 1
>>> mdf.combine_first(df) # values from mdf overwrite those of df
col
new 3
row 2
For the reference, here is extensive guide on different merge and join means of dataframes
UPDATE
Followup: as you wish behaviour similar to SQL union, one approach migth be:
>>> df = pd.DataFrame.from_dict({'col':{'row': 1, 'new': 3}})
>>> mdf.append(df).drop_duplicates()
col
new 3
row 2
row 1
Or if you want to take index in account,
>>> mdf['index'] = mdf.index
>>> df['index'] = df.index
>>> union = mdf.append(df).drop_duplicates()
>>> del union['index']
>>> union
col
new 3
row 2
row 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With