Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R ave by columns

Tags:

dataframe

r

I want to use the ave function on many columns (tens) on the data frame:

ave(df[,the_cols], df[,c('site', 'month')], FUN = mean)

The problem is that ave runs the mean function on all the_cols columns together. Is there any way to run it for each of the_cols columns separately?

I tried to look at the other functions. tapply and aggregate are different, they return only one row per group. I need the ave behaviour, i.e. to return the same number of rows as given in the original df. There is also a by function, but using it would be very clumsy as it returns a complicated list structure that would have to be converted somehow.

Certainly many clumsy and ugly (by & do.call, multiple *apply function calls etc.) solutions exist but is there some really easy and elegant?

like image 224
Tomas Avatar asked Jan 24 '14 19:01

Tomas


People also ask

What is the average value in the first row in R?

The average value in the first row across the first and third columns is 1. The average value in the second row across the first and third columns is 7. And so on. You can find more R tutorials here.

What is the Ave function in Excel?

In contrast, the ave function returns the output once per input value. There’s more to learn about the ave function! So keep on reading… Example 2 shows how to use the ave function to calculate averages by group.

How to get average of values in same row but different columns?

Getting the average of values in the same row but different columns is a tool for understanding the relationship if one exists. Explanation of the function. The rowMeans ()average function finds the average numeric vector of a dataframe or other multi-column data set, like an array or a matrix.

Does Col4 ignore the NAS in the first 3 columns?

The only minimally tricky aspect is that some columns contain NAs. I want to create a Col4 that averages the entries in the first 3 columns, ignoring the NAs. So: but it doesn't. Show activity on this post. Show activity on this post. Why NOT the accepted answer?


3 Answers

Perhaps I'm missing something, but an apply() approach here would work very well and wouldn't be ugly or require any ugly hacks. Some dummy data:

df <- data.frame(A = rnorm(20), B = rnorm(20), site = gl(5,4), month = gl(10, 2))

what is wrong with:

sapply(df[, c("A","B")], ave, df$site, df$month)

? Coerce that to a data frame via data.frame() if you really want that.

R> sapply(df[, c("A","B")], ave, df$site, df$month)
            A        B
 [1,]  0.0775  0.04845
 [2,]  0.0775  0.04845
 [3,] -1.5563  0.43443
 [4,] -1.5563  0.43443
 [5,]  0.7193  0.01151
 [6,]  0.7193  0.01151
 [7,] -0.9243 -0.28483
 [8,] -0.9243 -0.28483
 [9,]  0.3316  0.14473
[10,]  0.3316  0.14473
[11,] -0.2539  0.20384
[12,] -0.2539  0.20384
[13,]  0.5558 -0.37239
[14,]  0.5558 -0.37239
[15,]  0.1976 -0.22693
[16,]  0.1976 -0.22693
[17,]  0.2031  1.11041
[18,]  0.2031  1.11041
[19,]  0.3229 -0.53818
[20,]  0.3229 -0.53818

Putting it together a bit more, how about

AVE <- function(df, cols, ...) {
  dots <- list(...)
  out <- sapply(df[, cols], ave, ...)
  out <- data.frame(as.data.frame(dots), out)
  names(out) <- c(paste0("Fac", seq_along(dots)), cols)
  out
}

R> AVE(df, c("A","B"), df$site, df$month)
   Fac1 Fac2       A        B
1     1    1  0.0775  0.04845
2     1    1  0.0775  0.04845
3     1    2 -1.5563  0.43443
4     1    2 -1.5563  0.43443
5     2    3  0.7193  0.01151
6     2    3  0.7193  0.01151
7     2    4 -0.9243 -0.28483
8     2    4 -0.9243 -0.28483
9     3    5  0.3316  0.14473
10    3    5  0.3316  0.14473
11    3    6 -0.2539  0.20384
12    3    6 -0.2539  0.20384
13    4    7  0.5558 -0.37239
14    4    7  0.5558 -0.37239
15    4    8  0.1976 -0.22693
16    4    8  0.1976 -0.22693
17    5    9  0.2031  1.11041
18    5    9  0.2031  1.11041
19    5   10  0.3229 -0.53818
20    5   10  0.3229 -0.53818

The details of working with ... escape me at the moment, but you should be able to get better names for the Fac1 etc that I used here.

I'll throw an alternative representation out there for you: aggregate() but use the ave() function instead of mean():

R> aggregate(cbind(A, B) ~ site + month, data = df, ave)
   site month     A.1     A.2      B.1      B.2
1     1     1  0.0775  0.0775  0.04845  0.04845
2     1     2 -1.5563 -1.5563  0.43443  0.43443
3     2     3  0.7193  0.7193  0.01151  0.01151
4     2     4 -0.9243 -0.9243 -0.28483 -0.28483
5     3     5  0.3316  0.3316  0.14473  0.14473
6     3     6 -0.2539 -0.2539  0.20384  0.20384
7     4     7  0.5558  0.5558 -0.37239 -0.37239
8     4     8  0.1976  0.1976 -0.22693 -0.22693
9     5     9  0.2031  0.2031  1.11041  1.11041
10    5    10  0.3229  0.3229 -0.53818 -0.53818

Note quite the stated output, but it is something that is simple to reshape if needed.

like image 117
Gavin Simpson Avatar answered Oct 13 '22 11:10

Gavin Simpson


If you want to have a data.frame back

library(plyr)
## assuming that the_cols are string
## if col index just add the index of site and month
the_cols <- c("site", "month", the_cols)
ddply(df, c('site', 'month'), FUN = numcolwise(mean))[,the_cols]
like image 33
dickoa Avatar answered Oct 13 '22 13:10

dickoa


You can use by with colMeans

by(df[,the_cols], df[,c('site', 'month')], FUN = colMeans)

You can also use ave inside lapply:

res <- lapply(df[,the_cols], function(x) 
                               ave(x, df[,c('site', 'month')], FUN = mean))

data.frame(res) # create data frame
like image 3
Sven Hohenstein Avatar answered Oct 13 '22 12:10

Sven Hohenstein