Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mutate all columns matching a pattern each time based on the previous columns

Tags:

r

dplyr

How can I mutate all columns containing a pattern (mutate_at I guess) using each time the previous column using dplyr? --> Here for example all column continaing foo in their name should be mutated using the column just before (i.e., a for column fooa, b for foob and so on).


set.seed(13)
dfrows = 5
df = data.frame(a = rnorm(dfrows),
                fooa = runif(dfrows),
                b = rnorm(dfrows, mean=50, sd=5),
                foob = runif(dfrows, min=0, max=5),
                c = rnorm(dfrows, mean=100, sd=10),
                fooc = runif(dfrows, min=0, max=10))
df
#            a      fooa        b      foob         c     fooc
# 1  0.5543269 0.6611216 48.26791 3.0999527  98.06053 6.035485
# 2 -0.2802719 0.8783709 51.15647 0.1586242 113.96432 2.299504
# 3  1.7751634 0.8905590 52.34582 2.3070636 101.00663 9.668332
# 4  0.1873201 0.5662805 50.58978 1.6501046  98.85561 6.045547
# 5  1.1425261 0.5935473 50.35224 3.1676038 107.02225 6.396047

library(dplyr)
df %>% mutate(fooa = fooa/100 * a,
              foob = foob/100 * b,
              fooc = fooc/100 * c)
#            a         fooa        b       foob         c     fooc
# 1  0.5543269  0.003664775 48.26791 1.49628246  98.06053 5.918428
# 2 -0.2802719 -0.002461827 51.15647 0.08114656 113.96432 2.620614
# 3  1.7751634  0.015808878 52.34582 1.20765132 101.00663 9.765657
# 4  0.1873201  0.001060757 50.58978 0.83478430  98.85561 5.976363
# 5  1.1425261  0.006781434 50.35224 1.59495949 107.02225 6.845194

# Equivalently, in base R:
for (i in c(2, 4, 6)) {
  df[,i] = df[,i]/100 * df[, i-1]
}

So I am looking for something like this I guess:

# What should <PREVIOUS_COLUMN> be?
df %>% mutate_at(vars(contains('foo')), funs(./100 * <PREVIOUS_COLUMN>)) 

# OR, even better (more generic but in my case it will always be the previous column):
df %>% mutate_at(vars(contains('foo')), funs(./100 * <COLUMN_NAME_WITH_'foo'_PATTERN_REMOVED>)) 

EDIT: I should have mentioned that the original data.frame could contain more columns, possibly with another pattern than X then fooX, so that the ideal solution should be able to localize them properly (but I'll leave it as such as all answers provide nice solutions and features).

A better example would have been:

set.seed(13)
dfrows = 5
df = data.frame(a = rnorm(dfrows),
                fooa = runif(dfrows),
                b = rnorm(dfrows, mean=50, sd=5),
                foob = runif(dfrows, min=0, max=5),
                bla = 5,
                c = rnorm(dfrows, mean=100, sd=10),
                fooc = runif(dfrows, min=0, max=10),
                blo = 8)
df
#            a      fooa        b      foob bla         c     fooc blo
# 1  0.5543269 0.6611216 48.26791 3.0999527   5  98.06053 6.035485   8
# 2 -0.2802719 0.8783709 51.15647 0.1586242   5 113.96432 2.299504   8
# 3  1.7751634 0.8905590 52.34582 2.3070636   5 101.00663 9.668332   8
# 4  0.1873201 0.5662805 50.58978 1.6501046   5  98.85561 6.045547   8
# 5  1.1425261 0.5935473 50.35224 3.1676038   5 107.02225 6.396047   8
like image 464
ztl Avatar asked Jan 27 '21 09:01

ztl


3 Answers

This is actually vectorised operation so you can do this without any loops.

Using vector recycling :

df[, c(FALSE, TRUE)] <- df[, c(FALSE, TRUE)]/100 * df[, c(TRUE, FALSE)]
df

#           a         fooa        b       foob         c     fooc
#1  0.5543269  0.003664775 48.26791 1.49628246  98.06053 5.918428
#2 -0.2802719 -0.002461827 51.15647 0.08114656 113.96432 2.620614
#3  1.7751634  0.015808878 52.34582 1.20765132 101.00663 9.765657
#4  0.1873201  0.001060757 50.58978 0.83478430  98.85561 5.976363
#5  1.1425261  0.006781434 50.35224 1.59495949 107.02225 6.845194

If you want to perform this based on names of the column.

cols <- grep('foo', names(df))
df[cols] <- df[cols]/100 * df[-cols]
like image 60
Ronak Shah Avatar answered Nov 17 '22 00:11

Ronak Shah


Here is another approach using across() and cur_column(). I personally would not recommend to make calculations based on the position of columns, and would instead recommend to work with the columns names, since this seems safer.

In the example below we loop over the columns a,b and c with across and access the values of each corresponding foo column by using get() and cur_column.

set.seed(13)
dfrows = 5
df = data.frame(a = rnorm(dfrows),
                fooa = runif(dfrows),
                b = rnorm(dfrows, mean=50, sd=5),
                foob = runif(dfrows, min=0, max=5),
                c = rnorm(dfrows, mean=100, sd=10),
                fooc = runif(dfrows, min=0, max=10))

library(dplyr)

df %>% 
  mutate(across(matches("^[a-z]$"),
                ~ get(paste0("foo", cur_column())) / 100 * .x,
                .names = "foo{col}"))
#>            a         fooa        b       foob         c     fooc
#> 1  0.5543269  0.003664775 48.26791 1.49628246  98.06053 5.918428
#> 2 -0.2802719 -0.002461827 51.15647 0.08114656 113.96432 2.620614
#> 3  1.7751634  0.015808878 52.34582 1.20765132 101.00663 9.765657
#> 4  0.1873201  0.001060757 50.58978 0.83478430  98.85561 5.976363
#> 5  1.1425261  0.006781434 50.35224 1.59495949 107.02225 6.845194

Created on 2021-01-27 by the reprex package (v0.3.0)

like image 41
TimTeaFan Avatar answered Nov 16 '22 23:11

TimTeaFan


One option could be:

df %>%
 mutate(across(starts_with("foo"))/100 * across(!matches("foo")))

           a         fooa        b       foob         c     fooc
1  0.5543269  0.003664775 48.26791 1.49628246  98.06053 5.918428
2 -0.2802719 -0.002461827 51.15647 0.08114656 113.96432 2.620614
3  1.7751634  0.015808878 52.34582 1.20765132 101.00663 9.765657
4  0.1873201  0.001060757 50.58978 0.83478430  98.85561 5.976363
5  1.1425261  0.006781434 50.35224 1.59495949 107.02225 6.845194
like image 29
tmfmnk Avatar answered Nov 17 '22 00:11

tmfmnk