Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up this loop to create dummy columns with data.table and set in R [duplicate]

I have a data table and I want to create a new column for each unique day, and then assign a 1 in each row where the day matches the column name

I have done this using a for loop but I was wondering if there was any way to optimise it using data.table and set?

Here is an example

dt <- data.table(Week_Day = c("Monday", "Tuesday", "Wednesday",
                          "Thursday", "Friday", "Saturday", "Sunday"))

Day <- unique(dt$Week_Day)
for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
        dt[, Day[i] := ifelse(Week_Day == Day[i], 1, 0)]
    }
}

my table is 298k rows and although it doesn't take long to execute (below), its part of a long script and I have quite a few inefficient loops so I am trying to get the overall run time down.

Time to run:

user  system elapsed
0.99    0.06    1.05

Thanks in advance.

like image 574
MidnightDataGeek Avatar asked Jul 22 '16 12:07

MidnightDataGeek


2 Answers

Here's a different approach that, performs better - on my machine - than the original approach in the question

1) Get unique days except Sunday

Day <- setdiff(dt$Week_Day, "Sunday")

2) Initialize new columns with 0:

dt[, (Day) := 0L]

3) Update with 1s by reference in a loop:

for(x in Day) {
  set(dt, i = which(dt[["Week_Day"]] == x), j = x, value = 1L)
}

Simple performance comparison:

dt1 <- data.table(Week_Day = sample(c("Monday", "Tuesday", "Wednesday",
                              "Thursday", "Friday", "Saturday", "Sunday"), 3e5, TRUE))

dt2 <- copy(dt1)


system.time({
  Day <- setdiff(unique(dt$Week_Day), "Sunday")
  dt1[, (Day) := 0L]
  for(x in Day) {
    set(dt1, i = which(dt1[["Week_Day"]] == x), j = x, value = 1L)
  }
})
#       User      System verstrichen 
#      0.029       0.003       0.032 

system.time({
  Day <- unique(dt$Week_Day)
  for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
      dt2[, Day[i] := ifelse(Week_Day == Day[i], 1L, 0L)]
    }
  }
})

#       User      System verstrichen 
#      0.138       0.070       0.210 


all.equal(dt1, dt2)
#[1] TRUE
like image 168
talat Avatar answered Nov 18 '22 17:11

talat


Here is one attempt at a speed up:

Day <- unique(dt$Week_Day)
setkey(dt, Week_Day)

# create columns of 0s
dt[, (Day) := 0L]

for (i in seq_along(head(Day, -1))) {
     dt[Day[i], Day[i] := 1L]
}

This implements a couple of the data.table speed ups including binary search in the second chain and the elimination of ifelse with replacement by reference.

like image 38
lmo Avatar answered Nov 18 '22 19:11

lmo