Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute for column values based on conditional substrings in the column names

Tags:

r

mean

I have a dataframe with hundreads of columns. Just for example purposes I'm going to present a toy dataframe.

TPT_A_2 | TPT_B_2 | TPT_C_2 | TPT_A_4 | TPT_B_4 | TPT_C_4 | TPT_A_6 | TPT_B_6 | TPT_C_6 | 
 100        100       100       200       200      200       400       400        400   

I want to compute the mean for those variables with the same initial substrings as name (TPT_A, TPT_B..) that end with 2 and 4. So I would get something like:

TPT_A_mean | TPT_B_mean | TPT_C_mean | TPT_A_6 | TPT_B_6 | TPT_C_6 | 
  150           150          150         400      400        400  

This data would be:

row1 <- c("TPT_A_2", "TPT_B_2", "TPT_C_2","TPT_A_4", "TPT_B_4", "TPT_C_4", "TPT_A_6", "TPT_B_6", "TPT_C_6")
row2 <- c(100, 100, 100, 200, 200, 200, 400, 40, 400)   
data <- as.data.frame(rbind(row1, row2))
colnames(data) <- as.character(data[1,])
data <- data[-1,]
like image 761
RoyBatty Avatar asked Dec 31 '22 11:12

RoyBatty


1 Answers

First, your method for generating a frame is an anti-pattern, resulting in your numbers being converted to strings.

str(dat)
# 'data.frame': 1 obs. of  9 variables:
#  $ TPT_A_2: chr "100"
#  $ TPT_B_2: chr "100"
#  $ TPT_C_2: chr "100"
#  $ TPT_A_4: chr "200"
#  $ TPT_B_4: chr "200"
#  $ TPT_C_4: chr "200"
#  $ TPT_A_6: chr "400"
#  $ TPT_B_6: chr "40"
#  $ TPT_C_6: chr "400"

In this case, we can use:

row1 <- c("TPT_A_2", "TPT_B_2", "TPT_C_2","TPT_A_4", "TPT_B_4", "TPT_C_4", "TPT_A_6", "TPT_B_6", "TPT_C_6")
row2 <- c(100, 100, 100, 200, 200, 200, 400, 40, 400)   
dat <- as.data.frame(setNames(as.list(row2),row1))
str(dat)
# 'data.frame': 1 obs. of  9 variables:
#  $ TPT_A_2: num 100
#  $ TPT_B_2: num 100
#  $ TPT_C_2: num 100
#  $ TPT_A_4: num 200
#  $ TPT_B_4: num 200
#  $ TPT_C_4: num 200
#  $ TPT_A_6: num 400
#  $ TPT_B_6: num 40
#  $ TPT_C_6: num 400

From here ...

base R

dat2a <- subset(dat, select = grepl("TPT_[ABC]_[24]", colnames(dat)))
dat2b <- subset(dat, select = !grepl("TPT_[ABC]_[24]", colnames(dat)))
cbind(
  dat2b, 
  lapply(split.default(dat2a, gsub("_[24]$", "", colnames(dat2a))),
         function(z) mean(unlist(z)))
)
#   TPT_A_6 TPT_B_6 TPT_C_6 TPT_A TPT_B TPT_C
# 1     400      40     400   150   150   150

dplyr

library(dplyr)
library(purrr) # imap
dat %>%
  split.default(., gsub("_[24]$", "",  colnames(.))) %>%
  imap(., function(x, nm)  {
    if (ncol(x) > 1) {
      setNames(data.frame(mean(unlist(x))), paste0(nm, "_mean"))
    } else x
  }) %>%
  bind_cols()
#   TPT_A_mean TPT_A_6 TPT_B_mean TPT_B_6 TPT_C_mean TPT_C_6
# 1        150     400        150      40        150     400
like image 105
r2evans Avatar answered Apr 07 '23 11:04

r2evans