Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Row Means based on Column Substring

Tags:

dataframe

r

I have a dataframe that looks like this:

df <- data.frame("CB_1.1"=c(0,5,6,2), "CB_1.16"=c(1,5,3,6), "HC_2.11"=c(3,3,4,5), "HC_1.12"=c(2,3,4,5), "HC_1.13"=c(1,0,0,5))

> df
  CB_1.1 CB_1.16 HC_2.11 HC_1.12 HC_1.13
1      0       1       3       2       1
2      5       5       3       3       0
3      6       3       4       4       0
4      2       6       5       5       5

I would like to take the mean of rows that share substring of the column name, before the ".". Resulting in a dataframe like this:

  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0

You'll notice that the column HC_2.11 values remain the same, because no other column has HC_2 in this dataframe.

Any help would be appreciated!

like image 251
Workhorse Avatar asked Mar 02 '23 16:03

Workhorse


2 Answers

1) apply/tapply For each row use tapply on it using an INDEX of the name prefixes and a function mean. Transpose the result. No packages are used.

prefix <- sub("\\..*", "", names(df))
t(apply(df, 1, tapply, prefix, mean))

giving this matrix (wrap it in data.frame(...) if you need a data frame result):

     CB_1 HC_1 HC_2
[1,]  0.5  1.5    3
[2,]  5.0  1.5    3
[3,]  4.5  2.0    4
[4,]  4.0  5.0    5

2) lm Run the indicated regression. The +0 in the formula means don't add on an intercept. The transpose of the coefficients will be the required matrix, m. In the next line make the names nicer. prefix is from (1). No packages are used.

m <- t(coef(lm(t(df) ~ prefix + 0)))
colnames(m) <- sub("prefix", "", colnames(m))
m

giving this matrix

     CB_1 HC_1 HC_2
[1,]  0.5  1.5    3
[2,]  5.0  1.5    3
[3,]  4.5  2.0    4
[4,]  4.0  5.0    5

This follows from the facts that (1) the model matrix X contains only ones and zeros and (2) distinct columns of it are orthogonal. The model matrix is shown here:

X <- model.matrix(~ prefix + 0) # model matrix
X

giving:

  prefixCB_1 prefixHC_1 prefixHC_2
1          1          0          0
2          1          0          0
3          0          0          1
4          0          1          0
5          0          1          0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$prefix
[1] "contr.treatment"

Because the columns of the model matrix X are orthogonal the coefficient corresponding to any column for a particular row, y, of df (column of t(df)) is just sum(x * y) / sum(x * x) and since x is a 0/1 vector that equals the mean of the values of y corresponding to the 1's in x.

3) stack/tapply Convert to long form inserting an id column at the same time. Then use tapply to convert back to wide form tapply-ing mean. No packages are used.

long <- transform(stack(df), ind = sub("\\..*", "", ind), id = c(row(df)))    
with(long, tapply(values, long[c("id", "ind")], mean))

giving this table. Wrap it in as.data.frame.matrix if you want a data.frame.

   ind
id  CB_1 HC_1 HC_2
  1  0.5  1.5    3
  2  5.0  1.5    3
  3  4.5  2.0    4
  4  4.0  5.0    5
like image 137
G. Grothendieck Avatar answered Mar 05 '23 14:03

G. Grothendieck


  • Here is a base R solution using rowMeans + split.default, i.e.,
dfout <- as.data.frame(Map(rowMeans, split.default(df,factor(s <- gsub("\\..*$","",names(df)), levels = unique(s)))))

such that

> dfout
  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0
  • If you do not mind the order of column names, you can use the shorter code below
dfout <- as.data.frame(Map(rowMeans,split.default(df,gsub("\\..*$","",names(df)))))

such that

> dfout
  CB_1 HC_1 HC_2
1  0.5  1.5    3
2  5.0  1.5    3
3  4.5  2.0    4
4  4.0  5.0    5
like image 41
ThomasIsCoding Avatar answered Mar 05 '23 14:03

ThomasIsCoding