Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate the mean of those columns in a data frame with the same column name

Tags:

r

I have a data frame consisting of 10299 observations of 66 variables. Some of these variables share a common column name, and I would like to calculate the mean of those variables for each observation.

Having the following matrix, with column names c(A, B, C, B, A ,C):

A B C B A C                             
1 2 3 4 5 6         
3 5 6 7 4 3                             
3 3 3 3 5 5                             
2 2 2 2 2 2

I would like to get:

A   B   C    
3   3   4.5
3.5 6   4.5
4   3   4
2   2   2

I tried for loops, the command aggregate() but I don't get the desired result.

Sorry if the question seems too basic, I have checked google for possible solutions but I didn't find any.

like image 868
user2500444 Avatar asked Nov 30 '22 01:11

user2500444


1 Answers

Here's a solution.

First let's define an exemplary data.frame (the same as in your example).

df <- as.data.frame(
    matrix(c(1,3,3,2,2,5,3,2,3,6,3,2,4,7,3,2,5,4,5,2,6,3,5,2),
        ncol=6,
        dimnames=list(NULL, c("A", "B", "C", "B", "A", "C"))
    )
  )

Below we apply a custom function on each unique column name, col: it selects all the columns named col and calculates rowMeans. The result, list of atomic vectors, will be coerced to a data.frame:

res <- as.data.frame( # sapply returns a list here, so we convert it to a data.frame
    sapply(unique(names(df)), # for each unique column name
       function(col) rowMeans(df[names(df) == col]) # calculate row means
    )
  )

The result:

res
##     A B   C
## 1 3.0 3 4.5
## 2 3.5 6 4.5
## 3 4.0 3 4.0
## 4 2.0 2 2.0

EDIT: As there are many solutions proposed already, let's benchmark them:

set.seed(123)
df <- as.data.frame(matrix(sample(1:9, replace=TRUE, 10000*100),
   dimnames=list(NULL, sample(LETTERS[1:5], 100, replace=TRUE)), ncol=100))
library(microbenchmark)
microbenchmark(...)
## Unit: milliseconds
##                   min         lq     median         uq        max neval
## @gagolews   61.196075   65.73211   77.22533  119.42028  127.32557    10
## @joran       8.297964   10.05242   10.90564   15.25943   65.69156    10
## @Davide   5535.272680 5731.24220 5754.67006 5808.47807 5862.22628    10

The clear winner (at least as far as speed is concerned) is @joran's lapply+split+Reduce. Congrats! :-)

like image 91
gagolews Avatar answered Dec 05 '22 02:12

gagolews