I have a set of DataFrames with numeric values and partly overlapping indices. I would like to merge them an take the mean if an index occurs in more than one DataFrame. <pre class="prettyprint"><code>import pandas as pd import numpy as np df1 = pd.DataFrame([1,2,3], columns=['col'], index=['a','b','c']) df2 = pd.DataFrame([4,5,6], columns=['col'], index=['b','c','d']) </code></pre> This gives me two DataFrames: <pre class="prettyprint"><code> col col a 1 b 4 b 2 c 5 c 3 d 6 </code></pre> Now I would like to merge the DataFrames and take the mean for each index (if applicable, i.e. if it occurs more than once). Should look like this: <pre class="prettyprint"><code> col a 1 b 3 c 4 d 6 </code></pre> Can I do this with some advanced merging/joining?

something like this: <pre class="prettyprint"><code>df3 = pd.concat((df1, df2)) df3.groupby(df3.index).mean() # col # a 1 # b 3 # c 4 # d 6 </code></pre> or other way around, as in @unutbu answer: <pre class="prettyprint"><code>pd.concat((df1, df2), axis=1).mean(axis=1) </code></pre>

<pre class="prettyprint"><code>In [22]: pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1) Out[23]: a 1 b 3 c 4 d 6 dtype: float64 </code></pre> <hr> Regarding Roman's question, I find IPython's <code>%timeit</code> command a convenient way to benchmark code: <pre class="prettyprint"><code>In [28]: %timeit df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean() 1000 loops, best of 3: 617 µs per loop In [29]: %timeit pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1) 1000 loops, best of 3: 577 µs per loop In [39]: %timeit pd.concat((df1, df2), axis=1).mean(axis=1) 1000 loops, best of 3: 524 µs per loop </code></pre> In this case, <code>pd.concat(...).mean(...)</code> turns out to be a bit faster. But really we should test bigger dataframes to get a more meaningful benchmark. By the way, if you do not want to install IPython, equivalent benchmarks can be run using Python's <code>timeit</code> module. It just takes a bit more setup. The docs has some examples showing how to do this. <hr> Note that if <code>df1</code> or <code>df2</code> were to have duplicate entries in its index, for example like this: <pre class="prettyprint"><code>N = 1000 df1 = pd.DataFrame([1,2,3]*N, columns=['col'], index=['a','b','c']*N) df2 = pd.DataFrame([4,5,6]*N, columns=['col'], index=['b','c','d']*N) </code></pre> then these three answers give different results: <pre class="prettyprint"><code>In [56]: df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean() Out[56]: col a 1 b 3 c 4 d 6 </code></pre> <code>pd.merge</code> probably does not give the kind of answer you want: <pre class="prettyprint"><code>In [58]: len(pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)) Out[58]: 2002000 </code></pre> While <code>pd.concat((df1, df2), axis=1)</code> raises a ValueError: <pre class="prettyprint"><code>In [48]: pd.concat((df1, df2), axis=1) ValueError: cannot reindex from a duplicate axis </code></pre>

Merge DataFrames in Pandas using the mean

Tags:

python

merge

pandas

I have a set of DataFrames with numeric values and partly overlapping indices. I would like to merge them an take the mean if an index occurs in more than one DataFrame.

import pandas as pd
import numpy as np

df1 = pd.DataFrame([1,2,3], columns=['col'], index=['a','b','c'])
df2 = pd.DataFrame([4,5,6], columns=['col'], index=['b','c','d'])

This gives me two DataFrames:

   col            col
a    1        b     4
b    2        c     5
c    3        d     6

Now I would like to merge the DataFrames and take the mean for each index (if applicable, i.e. if it occurs more than once).

Should look like this:

Can I do this with some advanced merging/joining?

497

asked Oct 21 '13 08:10

Martin Preusse

2 Answers

something like this:

df3 = pd.concat((df1, df2))
df3.groupby(df3.index).mean()

#    col
# a    1
# b    3
# c    4
# d    6

or other way around, as in @unutbu answer:

pd.concat((df1, df2), axis=1).mean(axis=1)

answered Oct 18 '22 13:10

Roman Pekar

In [22]: pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
Out[23]: 
a    1
b    3
c    4
d    6
dtype: float64

Regarding Roman's question, I find IPython's %timeit command a convenient way to benchmark code:

In [28]: %timeit df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
1000 loops, best of 3: 617 µs per loop

In [29]: %timeit pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1)
1000 loops, best of 3: 577 µs per loop

In [39]: %timeit pd.concat((df1, df2), axis=1).mean(axis=1)
1000 loops, best of 3: 524 µs per loop

In this case, pd.concat(...).mean(...) turns out to be a bit faster. But really we should test bigger dataframes to get a more meaningful benchmark.

By the way, if you do not want to install IPython, equivalent benchmarks can be run using Python's timeit module. It just takes a bit more setup. The docs has some examples showing how to do this.

Note that if df1 or df2 were to have duplicate entries in its index, for example like this:

N = 1000
df1 = pd.DataFrame([1,2,3]*N, columns=['col'], index=['a','b','c']*N)
df2 = pd.DataFrame([4,5,6]*N, columns=['col'], index=['b','c','d']*N)

then these three answers give different results:

In [56]: df3 = pd.concat((df1, df2)); df3.groupby(df3.index).mean()
Out[56]: 
   col
a    1
b    3
c    4
d    6

pd.merge probably does not give the kind of answer you want:

In [58]: len(pd.merge(df1, df2, left_index=True, right_index=True, how='outer').mean(axis=1))
Out[58]: 2002000

While pd.concat((df1, df2), axis=1) raises a ValueError:

In [48]: pd.concat((df1, df2), axis=1)
ValueError: cannot reindex from a duplicate axis

answered Oct 18 '22 11:10

unutbu

Related questions
                            
                                Where do I begin learning to program bots?
                            
                                In django-admin, how can I set filter_horizontal as the default?
                            
                                generate random lognormal distributions using shape of observed data
                            
                                How can you access a serial port from two different processes (Python)
                            
                                Python ConfigParser - usage across modules
                            
                                Can I nest TestCases with Nose?
                            
                                Proof of concept RESTful Python server (using web.py) + testing with cURL
                            
                                How to authenticate Google APIs (Google Drive API) from Google Compute Engine and locally without downloading Service Account credentials?
                            
                                Calculation on my for loop and want to do it without for loop using some function
                            
                                Python generic type that implements protocol
                            
                                python environment case senstivity - os.environ[...]
                            
                                How can I enable CORS in FastAPI?
                            
                                How to get random value of attribute of Enum on each iteration?
                            
                                Why does termcolor output control characters instead of colored text in the Windows console?
                            
                                tf.SequenceExample with multidimensional arrays
                            
                                What are ways to speed up seaborns pairplot
                            
                                Create a generic serializer with a dynamic model in Meta
                            
                                Programmatically Download Content from Shared Dropbox Folder Links
                            
                                Separate sections in Python [closed]
                            
                                When to use cla(), clf() or close() for clearing a plot in matplotlib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With