Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the mean across multiple Pandas DataFrames

I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.

         Source.0  Source.1  Source.2  Source.3 cluster                                         0        0.001182  0.184535  0.814230  0.000054 1        0.000001  0.160490  0.839508  0.000001 2        0.000001  0.173829  0.826114  0.000055 3        0.000432  0.180065  0.819502  0.000001 4        0.000152  0.157041  0.842694  0.000113 5        0.000183  0.174142  0.825674  0.000001 6        0.000001  0.151556  0.848405  0.000038 7        0.000771  0.177583  0.821645  0.000001 8        0.000001  0.202059  0.797939  0.000001 9        0.000025  0.189537  0.810410  0.000028 10       0.006142  0.003041  0.493912  0.496905 11       0.003739  0.002367  0.514216  0.479678 12       0.002334  0.001517  0.529041  0.467108 13       0.003458  0.000001  0.532265  0.464276 14       0.000405  0.005655  0.527576  0.466364 15       0.002557  0.003233  0.507954  0.486256 16       0.004161  0.000001  0.491271  0.504568 17       0.001364  0.001330  0.528311  0.468996 18       0.002886  0.000001  0.506392  0.490721 19       0.001823  0.002498  0.509620  0.486059           Source.0  Source.1  Source.2  Source.3 cluster                                         0        0.000001  0.197108  0.802495  0.000396 1        0.000001  0.157860  0.842076  0.000063 2        0.094956  0.203057  0.701662  0.000325 3        0.000001  0.181948  0.817841  0.000210 4        0.000003  0.169680  0.830316  0.000001 5        0.000362  0.177194  0.822443  0.000001 6        0.000001  0.146807  0.852924  0.000268 7        0.001087  0.178994  0.819564  0.000354 8        0.000001  0.202182  0.797333  0.000485 9        0.000348  0.181399  0.818252  0.000001 10       0.003050  0.000247  0.506777  0.489926 11       0.004420  0.000001  0.513927  0.481652 12       0.006488  0.001396  0.527197  0.464919 13       0.001510  0.000001  0.525987  0.472502 14       0.000001  0.000001  0.520737  0.479261 15       0.000001  0.001765  0.515658  0.482575 16       0.000001  0.000001  0.492550  0.507448 17       0.002855  0.000199  0.526535  0.470411 18       0.000001  0.001952  0.498303  0.499744 19       0.001232  0.000001  0.506612  0.492155 

Then I want to get the mean of these two dataframes.

What is the easiest way to do this?

Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.

So in the example I gave, the average for [0,Source.0] would be (0.001182 + 0.000001) / 2 = 0.0005915.

like image 270
Tim Avatar asked Jul 31 '14 11:07

Tim


1 Answers

Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:

import numpy as np import pandas as pd  # some random data frames df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100))) df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))  # concatenate them df_concat = pd.concat((df1, df2))  print df_concat.mean() # x   -0.163044 # y    2.120000 # dtype: float64  print df_concat.median() # x   -0.192037 # y    2.000000 # dtype: float64 

Update

If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:

by_row_index = df_concat.groupby(df_concat.index) df_means = by_row_index.mean()  print df_means.head() #           x    y # 0 -0.850794  1.5 # 1  0.159038  1.5 # 2  0.083278  1.0 # 3 -0.540336  0.5 # 4  0.390954  3.5 

This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.

like image 116
ali_m Avatar answered Sep 20 '22 17:09

ali_m