Statistics of multiple similarly named columns

Question

I have a huge dataset with multiple columns like x1, x2, x3......x25, y1, y2, y3......y50, z1, z2.......z10 etc. which looks something like this:

x1  x2  x3  x4  y1  y2  y3  
1   2   1   2   1   1   2   
2   1   1   1   3   1   1
1   2   2   1   1   2   1

What I want is:

x_mean  x_min  x_max  x_mad  y_mean  y_min  y_max  y_mad
  1.5     1      2     0.74    2       1      2      0 
  1.25    1      2       0     2       1      2      0 
  1.5     1      2     0.74    2       1      2      0

Basically, I need to calculate min, max, mad (median absolute deviation), and mean

> x_mean = (x1+x2+x3+x4)/4 = (1+2+1+2)/4

and similarly for other rows and other stats. How do I do this in R, preferably in dplyr?

alistaire · Accepted Answer

Generally you can use the scoped variants of summarise like summarise_all, which in their funs helper function can accept any number of summary functions. In your case, you should reshape to long form first so as to make your data tidy (here move observations from columns to rows) and therefore make your analysis simpler:

library(tidyverse)

df <- read.table(text = 'x1  x2  x3  x4  y1  y2  y3  
1   2   1   2   1   1   2   
2   1   1   1   3   1   1
1   2   2   1   1   2   1', head = TRUE)

df_tidy <- df %>% 
    mutate(row = row_number()) %>%       # keep position info
    gather(var, val, -row) %>%           # reshape to long
    mutate(var = sub('\d', '', var))    # extract letters from former colnames

df_summary <- df_tidy %>%
    group_by(var, row) %>%                # group by variable and original row
    summarise_all(funs(min, max, mad))    # summarize with various functions

df_summary
#> # A tibble: 6 x 5
#> # Groups:   var [?]
#>     var   row   min   max    mad
#>   <chr> <int> <dbl> <dbl>  <dbl>
#> 1     x     1     1     2 0.7413
#> 2     x     2     1     2 0.0000
#> 3     x     3     1     2 0.7413
#> 4     y     1     1     2 0.0000
#> 5     y     2     1     3 0.0000
#> 6     y     3     1     2 0.0000

You can reshape this back to wide if you like, but for anything save presentation purposes I'd advise against it.

Sotos · Answer

Not as elegant as the tidyverse method, but still a base R option,

#create a function to do whatever calculations needed,
f1 <- function(d){
  mean1 <- rowMeans(d)
  min1 <- do.call(pmin, d)
  max1 <- do.call(pmax, d)
  mad1 <- apply(d, 1, mad)
  return(data.frame(mean1, min1, max1, mad1))
}

#apply it to your data frame based on the unique name letters,
ind <- unique(sub('\d+', '', names(df)))
setNames(lapply(ind, function(i) f1(df[grepl(i, names(df))])), ind)
#$x
#  mean1 min1 max1   mad1
#1  1.50    1    2 0.7413
#2  1.25    1    2 0.0000
#3  1.50    1    2 0.7413

#$y
#     mean1 min1 max1 mad1
#1 1.333333    1    2    0
#2 1.666667    1    3    0
#3 1.333333    1    2    0

You can always convert to your expected output by,

do.call(cbind, setNames(lapply(ind, function(i) f1(df[grepl(i, names(df))])), ind))
#  x.mean1 x.min1 x.max1 x.mad1  y.mean1 y.min1 y.max1 y.mad1
#1    1.50      1      2 0.7413 1.333333      1      2      0
#2    1.25      1      2 0.0000 1.666667      1      3      0
#3    1.50      1      2 0.7413 1.333333      1      2      0

Statistics of multiple similarly named columns

Tags:

r

dplyr

statistics

ultron

2 Answers

alistaire

Sotos

Recent Activity

Donate For Us

Statistics of multiple similarly named columns

Tags:

r

dplyr

statistics

ultron

2 Answers

alistaire

Sotos

Related questions

Recent Activity

Donate For Us