Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a series of new columns with (d)plyr

Say for example I have a data frame with a column a and I want to create columns a^i for several values of i.

> dat <- data.frame(a=1:5)
> dat
    a
1   1
2   2
3   3
4   4
5   5

As an example, the output I want for i=2:5:

  a power_2 power_3 power_4 power_5
1 1       1       1       1       1
2 2       4       8      16      32
3 3       9      27      81     243
4 4      16      64     256    1024
5 5      25     125     625    3125

Currently I get this output with data.table as follows:

DT <- data.table(dat)
exponents <- 2:5
DT[, paste0("power_",exponents):=lapply(exponents, function(p) a^p)]

How to do with plyr/dplyr ? Of course I could do as below by typing power_i=a^i for each i but this is not what I want.

mutate(dat, power_2=a^2, power_3=a^3, ...)

Conclusion after answers

Several answers have been proposed, and have been compared by @docendo discimus. I'm just adding the comparison with data.table.

library(data.table)
library(dplyr)
set.seed(2015)
dat <- data.frame(a = sample(1000))
i <- 2:5
n <- c(names(dat), paste0("power_", i))
DT <-  data.table(dat)

library(microbenchmark)

microbenchmark(
  data.table = DT[, paste0("power_",i):=lapply(i, function(k) a^k)],
  Henrik = dat %>% do(data.frame(., outer(.$a, i, `^`))) %>% setNames(n),
  dd.do = dat %>% do(data.frame(., sapply(i, function(x) .$a^x))) %>% setNames(n),
  dd.bc = dat %>% bind_cols(as.data.frame(lapply(i, function(x) .$a^x))) %>% setNames(n),
  times = 30,
  unit = "relative"
)
Unit: relative
       expr       min        lq      mean    median        uq       max neval cld
 data.table  1.022945  1.039674  1.108558  1.026319  1.083644  2.370180    30  a 
     Henrik  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    30  a 
      dd.do  1.149195  1.160735  1.167672  1.158141  1.150280  1.268279    30  a 
      dd.bc 14.350034 13.982658 13.737964 13.632361 13.606221 15.866711    30   b

Updated benchmark with two base solutions, Henrik2 and josh (from his comment), which are fastest:

set.seed(2015)
dat <- data.frame(a = sample(1000))

microbenchmark(
  data.table = DT[, paste0("power_",i):=lapply(i, function(k) a^k)],
  Henrik = dat %>% do(data.frame(., outer(.$a, i, `^`))) %>% setNames(n),
  Henrik2 = cbind(dat, outer(dat$a, setNames(i, paste0("power_", i)),  `^`)),
  dd.do = dat %>% do(data.frame(., sapply(i, function(x) .$a^x))) %>% setNames(n),
  dd.bc = dat %>% bind_cols(as.data.frame(lapply(i, function(x) .$a^x))) %>% setNames(n),
  josh = data.frame(dat, setNames(lapply(2:5, function(X) dat$a^X), paste0("power_", 2:5))),
  times = 30,
  unit = "relative"
)

# Unit: relative
#       expr       min        lq      mean    median        uq       max neval  cld
# data.table  1.991613  2.029778  1.982169  1.990417  1.946677  1.694030    30  bc 
#     Henrik  2.026345  2.017179  1.996419  2.003189  2.030176  1.733583    30  bc 
#    Henrik2  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    30 a   
#      dd.do  2.356886  2.375713  2.322452  2.348053  2.304826  2.101494    30   c 
#      dd.bc 37.445491 36.081298 34.791638 34.783854 34.787655 27.832116    30    d
#       josh  1.725750  1.699887  1.641290  1.625331  1.637823  1.330598    30  b
like image 560
Stéphane Laurent Avatar asked Dec 12 '22 01:12

Stéphane Laurent


1 Answers

One possibility is to use outer in do, and then set the names with setNames

i <- 2:5
dat %>%
  do(data.frame(., outer(.$a, i, `^`))) %>%
  setNames(., c("a", paste0("power_", i)))

#   a power_2 power_3 power_4 power_5
# 1 1       1       1       1       1
# 2 2       4       8      16      32
# 3 3       9      27      81     243
# 4 4      16      64     256    1024
# 5 5      25     125     625    3125

If you name the 'power vector' "i" first, you can call cbind instead of do and data.frame, and I see no immediate need for dplyr functions in this particular case.

cbind(dat, outer(dat$a, setNames(i, paste0("power_", i)),  `^`))
#   a power_2 power_3 power_4 power_5
# 1 1       1       1       1       1
# 2 2       4       8      16      32
# 3 3       9      27      81     243
# 4 4      16      64     256    1024
# 5 5      25     125     625    3125

The base, non-do code is faster for your larger sample data. I also added the base solution by @Josh O'Brien.

set.seed(2015)
dat <- data.frame(a = sample(1000))

microbenchmark(
  data.table = DT[, paste0("power_",i):=lapply(i, function(k) a^k)],
  Henrik = dat %>% do(data.frame(., outer(.$a, i, `^`))) %>% setNames(n),
  Henrik2 = cbind(dat, outer(dat$a, setNames(i, paste0("power_", i)),  `^`)),
  dd.do = dat %>% do(data.frame(., sapply(i, function(x) .$a^x))) %>% setNames(n),
  dd.bc = dat %>% bind_cols(as.data.frame(lapply(i, function(x) .$a^x))) %>% setNames(n),
  josh = data.frame(dat, setNames(lapply(2:5, function(X) dat$a^X), paste0("power_", 2:5))),
  times = 30,
  unit = "relative"
)

# Unit: relative
#       expr       min        lq      mean    median        uq       max neval  cld
# data.table  1.991613  2.029778  1.982169  1.990417  1.946677  1.694030    30  bc 
#     Henrik  2.026345  2.017179  1.996419  2.003189  2.030176  1.733583    30  bc 
#    Henrik2  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    30 a   
#      dd.do  2.356886  2.375713  2.322452  2.348053  2.304826  2.101494    30   c 
#      dd.bc 37.445491 36.081298 34.791638 34.783854 34.787655 27.832116    30    d
#       josh  1.725750  1.699887  1.641290  1.625331  1.637823  1.330598    30  b
like image 130
Henrik Avatar answered Jan 11 '23 23:01

Henrik