I have a set of DataFrames with numeric values and partly overlapping indices. I would like to merge them an take the mean if an index occurs in more than one DataFrame.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([1,2,3], columns=['col'], index=['a','b','c'])
df2 = pd.DataFrame([4,5,6], columns=['col'], index=['b','c','d'])
This gives me two DataFrames:
col col
a 1 b 4
b 2 c 5
c 3 d 6
Now I would like to merge the DataFrames and take the mean for each index (if applicable, i.e. if it occurs more than once).
Should look like this:
col
a 1
b 3
c 4
d 6
Can I do this with some advanced merging/joining?
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
concat() for combining DataFrames across rows or columns.
Unlike merge () which is a method of the Pandas instance, join () is a method of the DataFrame itself. This means that we can use it like a static method on the DataFrame: DataFrame.join (other, on=None, how='left', lsuffix='', rsuffix='', sort=False).
Let us see how to join two Pandas DataFrames using the merge () function. Returns : A DataFrame of the two merged objects. Example 2 : Merging two Dataframe with different number of elements : If we use how = "Outer", it returns all elements in df1 and df2 but if element column are null then its return NaN value.
Python Pandas - Merge DataFrame with indicator value Python Server Side Programming Programming To merge Pandas DataFrame, use the merge () function. In that, you can set the parameter indicator to True or False.
Note that you need to use double square brackets in order to properly select the data: This returns the following: Finally, if you wanted to return the mean for every column in a Pandas dataframe, you can simply apply the .mean () method to the entire dataframe.
something like this:
df3 = pd.concat((df1, df2))
df3.groupby(df3.index).mean()
# col
# a 1
# b 3
# c 4
# d 6
or other way around, as in @unutbu answer:
pd.concat((df1, df2), axis=1).mean(axis=1)
In [22]: pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
Out[23]:
a 1
b 3
c 4
d 6
dtype: float64
Regarding Roman's question, I find IPython's %timeit
command a convenient way to benchmark code:
In [28]: %timeit df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
1000 loops, best of 3: 617 µs per loop
In [29]: %timeit pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
1000 loops, best of 3: 577 µs per loop
In [39]: %timeit pd.concat((df1, df2), axis=1).mean(axis=1)
1000 loops, best of 3: 524 µs per loop
In this case, pd.concat(...).mean(...)
turns out to be a bit faster. But really we should test bigger dataframes to get a more meaningful benchmark.
By the way, if you do not want to install IPython, equivalent benchmarks can be run using Python's timeit
module. It just takes a bit more setup. The docs has some examples showing how to do this.
Note that if df1
or df2
were to have duplicate entries in its index, for example like this:
N = 1000
df1 = pd.DataFrame([1,2,3]*N, columns=['col'], index=['a','b','c']*N)
df2 = pd.DataFrame([4,5,6]*N, columns=['col'], index=['b','c','d']*N)
then these three answers give different results:
In [56]: df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
Out[56]:
col
a 1
b 3
c 4
d 6
pd.merge
probably does not give the kind of answer you want:
In [58]: len(pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1))
Out[58]: 2002000
While pd.concat((df1, df2), axis=1)
raises a ValueError:
In [48]: pd.concat((df1, df2), axis=1)
ValueError: cannot reindex from a duplicate axis
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With