Row Means based on Column Substring

Question

I have a dataframe that looks like this:

df <- data.frame("CB_1.1"=c(0,5,6,2), "CB_1.16"=c(1,5,3,6), "HC_2.11"=c(3,3,4,5), "HC_1.12"=c(2,3,4,5), "HC_1.13"=c(1,0,0,5))

> df
  CB_1.1 CB_1.16 HC_2.11 HC_1.12 HC_1.13
1      0       1       3       2       1
2      5       5       3       3       0
3      6       3       4       4       0
4      2       6       5       5       5

I would like to take the mean of rows that share substring of the column name, before the ".". Resulting in a dataframe like this:

  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0

You'll notice that the column HC_2.11 values remain the same, because no other column has HC_2 in this dataframe.

Any help would be appreciated!

G. Grothendieck · Accepted Answer

1) apply/tapply For each row use tapply on it using an INDEX of the name prefixes and a function mean. Transpose the result. No packages are used.

prefix <- sub("\..*", "", names(df))
t(apply(df, 1, tapply, prefix, mean))

giving this matrix (wrap it in data.frame(...) if you need a data frame result):

     CB_1 HC_1 HC_2
[1,]  0.5  1.5    3
[2,]  5.0  1.5    3
[3,]  4.5  2.0    4
[4,]  4.0  5.0    5

2) lm Run the indicated regression. The +0 in the formula means don't add on an intercept. The transpose of the coefficients will be the required matrix, m. In the next line make the names nicer. prefix is from (1). No packages are used.

m <- t(coef(lm(t(df) ~ prefix + 0)))
colnames(m) <- sub("prefix", "", colnames(m))
m

giving this matrix

     CB_1 HC_1 HC_2
[1,]  0.5  1.5    3
[2,]  5.0  1.5    3
[3,]  4.5  2.0    4
[4,]  4.0  5.0    5

This follows from the facts that (1) the model matrix X contains only ones and zeros and (2) distinct columns of it are orthogonal. The model matrix is shown here:

X <- model.matrix(~ prefix + 0) # model matrix
X

giving:

  prefixCB_1 prefixHC_1 prefixHC_2
1          1          0          0
2          1          0          0
3          0          0          1
4          0          1          0
5          0          1          0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$prefix
[1] "contr.treatment"

Because the columns of the model matrix X are orthogonal the coefficient corresponding to any column for a particular row, y, of df (column of t(df)) is just sum(x * y) / sum(x * x) and since x is a 0/1 vector that equals the mean of the values of y corresponding to the 1's in x.

3) stack/tapply Convert to long form inserting an id column at the same time. Then use tapply to convert back to wide form tapply-ing mean. No packages are used.

long <- transform(stack(df), ind = sub("\..*", "", ind), id = c(row(df)))    
with(long, tapply(values, long[c("id", "ind")], mean))

giving this table. Wrap it in as.data.frame.matrix if you want a data.frame.

   ind
id  CB_1 HC_1 HC_2
  1  0.5  1.5    3
  2  5.0  1.5    3
  3  4.5  2.0    4
  4  4.0  5.0    5

ThomasIsCoding · Answer

Here is a base R solution using rowMeans + split.default, i.e.,

dfout <- as.data.frame(Map(rowMeans, split.default(df,factor(s <- gsub("\..*$","",names(df)), levels = unique(s)))))

such that

> dfout
  CB_1 HC_2 HC_1
1  0.5    3  1.5
2  5.0    3  1.5
3  4.5    4  2.0
4  4.0    5  5.0

If you do not mind the order of column names, you can use the shorter code below

dfout <- as.data.frame(Map(rowMeans,split.default(df,gsub("\..*$","",names(df)))))

such that

> dfout
  CB_1 HC_1 HC_2
1  0.5  1.5    3
2  5.0  1.5    3
3  4.5  2.0    4
4  4.0  5.0    5

Row Means based on Column Substring

Tags:

dataframe

r

Workhorse

2 Answers

G. Grothendieck

ThomasIsCoding

Recent Activity

Donate For Us

Row Means based on Column Substring

Tags:

dataframe

r

Workhorse

2 Answers

G. Grothendieck

ThomasIsCoding

Related questions

Recent Activity

Donate For Us