I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.
Source.0 Source.1 Source.2 Source.3 cluster 0 0.001182 0.184535 0.814230 0.000054 1 0.000001 0.160490 0.839508 0.000001 2 0.000001 0.173829 0.826114 0.000055 3 0.000432 0.180065 0.819502 0.000001 4 0.000152 0.157041 0.842694 0.000113 5 0.000183 0.174142 0.825674 0.000001 6 0.000001 0.151556 0.848405 0.000038 7 0.000771 0.177583 0.821645 0.000001 8 0.000001 0.202059 0.797939 0.000001 9 0.000025 0.189537 0.810410 0.000028 10 0.006142 0.003041 0.493912 0.496905 11 0.003739 0.002367 0.514216 0.479678 12 0.002334 0.001517 0.529041 0.467108 13 0.003458 0.000001 0.532265 0.464276 14 0.000405 0.005655 0.527576 0.466364 15 0.002557 0.003233 0.507954 0.486256 16 0.004161 0.000001 0.491271 0.504568 17 0.001364 0.001330 0.528311 0.468996 18 0.002886 0.000001 0.506392 0.490721 19 0.001823 0.002498 0.509620 0.486059 Source.0 Source.1 Source.2 Source.3 cluster 0 0.000001 0.197108 0.802495 0.000396 1 0.000001 0.157860 0.842076 0.000063 2 0.094956 0.203057 0.701662 0.000325 3 0.000001 0.181948 0.817841 0.000210 4 0.000003 0.169680 0.830316 0.000001 5 0.000362 0.177194 0.822443 0.000001 6 0.000001 0.146807 0.852924 0.000268 7 0.001087 0.178994 0.819564 0.000354 8 0.000001 0.202182 0.797333 0.000485 9 0.000348 0.181399 0.818252 0.000001 10 0.003050 0.000247 0.506777 0.489926 11 0.004420 0.000001 0.513927 0.481652 12 0.006488 0.001396 0.527197 0.464919 13 0.001510 0.000001 0.525987 0.472502 14 0.000001 0.000001 0.520737 0.479261 15 0.000001 0.001765 0.515658 0.482575 16 0.000001 0.000001 0.492550 0.507448 17 0.002855 0.000199 0.526535 0.470411 18 0.000001 0.001952 0.498303 0.499744 19 0.001232 0.000001 0.506612 0.492155
Then I want to get the mean of these two dataframes.
What is the easiest way to do this?
Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.
So in the example I gave, the average for [0,Source.0]
would be (0.001182 + 0.000001) / 2 = 0.0005915.
Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:
import numpy as np import pandas as pd # some random data frames df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100))) df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100))) # concatenate them df_concat = pd.concat((df1, df2)) print df_concat.mean() # x -0.163044 # y 2.120000 # dtype: float64 print df_concat.median() # x -0.192037 # y 2.000000 # dtype: float64
If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby()
to group the data by row index, then apply the mean, median etc.:
by_row_index = df_concat.groupby(df_concat.index) df_means = by_row_index.mean() print df_means.head() # x y # 0 -0.850794 1.5 # 1 0.159038 1.5 # 2 0.083278 1.0 # 3 -0.540336 0.5 # 4 0.390954 3.5
This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With