Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge DataFrames in Pandas using the mean

I have a set of DataFrames with numeric values and partly overlapping indices. I would like to merge them an take the mean if an index occurs in more than one DataFrame.

import pandas as pd
import numpy as np

df1 = pd.DataFrame([1,2,3], columns=['col'], index=['a','b','c'])
df2 = pd.DataFrame([4,5,6], columns=['col'], index=['b','c','d'])

This gives me two DataFrames:

   col            col
a    1        b     4
b    2        c     5
c    3        d     6

Now I would like to merge the DataFrames and take the mean for each index (if applicable, i.e. if it occurs more than once).

Should look like this:

    col
a     1
b     3
c     4
d     6

Can I do this with some advanced merging/joining?

like image 497
Martin Preusse Avatar asked Oct 21 '13 08:10

Martin Preusse


People also ask

How do I merge two data frames in pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

How do I merge two DataFrames in pandas based on common column?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.

Which command is used for combining merging DataFrames in pandas?

concat() for combining DataFrames across rows or columns.

What is the difference between join () and merge () methods in pandas?

Unlike merge () which is a method of the Pandas instance, join () is a method of the DataFrame itself. This means that we can use it like a static method on the DataFrame: DataFrame.join (other, on=None, how='left', lsuffix='', rsuffix='', sort=False).

How to join two DataFrames in pandas Dataframe?

Let us see how to join two Pandas DataFrames using the merge () function. Returns : A DataFrame of the two merged objects. Example 2 : Merging two Dataframe with different number of elements : If we use how = "Outer", it returns all elements in df1 and df2 but if element column are null then its return NaN value.

How to merge pandas Dataframe with indicator value in Python?

Python Pandas - Merge DataFrame with indicator value Python Server Side Programming Programming To merge Pandas DataFrame, use the merge () function. In that, you can set the parameter indicator to True or False.

How to get the mean of a Dataframe in pandas?

Note that you need to use double square brackets in order to properly select the data: This returns the following: Finally, if you wanted to return the mean for every column in a Pandas dataframe, you can simply apply the .mean () method to the entire dataframe.


2 Answers

something like this:

df3 = pd.concat((df1, df2))
df3.groupby(df3.index).mean()

#    col
# a    1
# b    3
# c    4
# d    6

or other way around, as in @unutbu answer:

pd.concat((df1, df2), axis=1).mean(axis=1)
like image 96
Roman Pekar Avatar answered Oct 18 '22 13:10

Roman Pekar


In [22]: pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
Out[23]: 
a    1
b    3
c    4
d    6
dtype: float64

Regarding Roman's question, I find IPython's %timeit command a convenient way to benchmark code:

In [28]: %timeit df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
1000 loops, best of 3: 617 µs per loop

In [29]: %timeit pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
1000 loops, best of 3: 577 µs per loop

In [39]: %timeit pd.concat((df1, df2), axis=1).mean(axis=1)
1000 loops, best of 3: 524 µs per loop

In this case, pd.concat(...).mean(...) turns out to be a bit faster. But really we should test bigger dataframes to get a more meaningful benchmark.

By the way, if you do not want to install IPython, equivalent benchmarks can be run using Python's timeit module. It just takes a bit more setup. The docs has some examples showing how to do this.


Note that if df1 or df2 were to have duplicate entries in its index, for example like this:

N = 1000
df1 = pd.DataFrame([1,2,3]*N, columns=['col'], index=['a','b','c']*N)
df2 = pd.DataFrame([4,5,6]*N, columns=['col'], index=['b','c','d']*N)

then these three answers give different results:

In [56]: df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
Out[56]: 
   col
a    1
b    3
c    4
d    6

pd.merge probably does not give the kind of answer you want:

In [58]: len(pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1))
Out[58]: 2002000

While pd.concat((df1, df2), axis=1) raises a ValueError:

In [48]: pd.concat((df1, df2), axis=1)
ValueError: cannot reindex from a duplicate axis
like image 35
unutbu Avatar answered Oct 18 '22 11:10

unutbu