Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use ifelse row-wise in R data.table?

Tags:

r

data.table

I'd like to create a new column in an R data.table that is based on an ifelse() comparison of different columns. However, I'd like the ifelse statement to be applied row-wise. I've tried using the group by functionality of data.table, but it seems to apply the test condition of the ifelse row-wise but evaluate yes condition across all values in the column instead of using the by condition to do it row-wise. Below is an example and some of the solutions that I've tried.

I have an R data.table like this:

> set.seed(45)
> DT <- data.table(date = c(rep("2018-01-01", 3), rep("2018-01-02", 3), rep("2018-01-03", 3)), 
+                  id = rep(letters[1:3], 3), 
+                  v1 = sample(x = -20:20, size = 9), 
+                  v2 = sample(x = -20:20, size = 9))
> str(DT)
Classes ‘data.table’ and 'data.frame':  9 obs. of  4 variables:
 $ date: chr  "2018-01-01" "2018-01-01" "2018-01-01" "2018-01-02" ...
 $ id  : chr  "a" "b" "c" "a" ...
 $ v1  : int  5 -8 -11 -6 -7 -10 -13 -2 -14
 $ v2  : int  -20 -6 14 -9 -3 -5 19 12 -16
 - attr(*, ".internal.selfref")=<externalptr> 
> DT
         date id  v1  v2
1: 2018-01-01  a   5 -20
2: 2018-01-01  b  -8  -6
3: 2018-01-01  c -11  14
4: 2018-01-02  a  -6  -9
5: 2018-01-02  b  -7  -3
6: 2018-01-02  c -10  -5
7: 2018-01-03  a -13  19
8: 2018-01-03  b  -2  12
9: 2018-01-03  c -14 -16

I want the following output:

> DT_out
         date id  v1  v2  c
1: 2018-01-01  a   5 -20  0
2: 2018-01-01  b  -8  -6  0
3: 2018-01-01  c -11  14 11
4: 2018-01-02  a  -6  -9  0
5: 2018-01-02  b  -7  -3  0
6: 2018-01-02  c -10  -5  0
7: 2018-01-03  a -13  19 13
8: 2018-01-03  b  -2  12  2
9: 2018-01-03  c -14 -16  0 

Solutions I've tried:

Attempt #1) No error, but evaluates the min across all values in both v1 and v2. This behavior is to be expected; though, it's odd to me that it evaluates the test condition on a row-wise basis even without a key set or by stated:

> DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0)]
> DT
         date id  v1  v2   c
1: 2018-01-01  a   5 -20   0
2: 2018-01-01  b  -8  -6   0
3: 2018-01-01  c -11  14 -20
4: 2018-01-02  a  -6  -9   0
5: 2018-01-02  b  -7  -3   0
6: 2018-01-02  c -10  -5   0
7: 2018-01-03  a -13  19 -20
8: 2018-01-03  b  -2  12 -20
9: 2018-01-03  c -14 -16   0

Attempt #2) When I set the key and use the by condition, nothing changes but I receive an error message.

> setkey(DT, date, id)
> DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0), by = list(date, id)]
Error in `[.data.table`(DT, , `:=`(c, ifelse(v1 < 0 & v2 > 0, min(-v1,  : 
  Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
> DT
         date id  v1  v2   c
1: 2018-01-01  a   5 -20   0
2: 2018-01-01  b  -8  -6   0
3: 2018-01-01  c -11  14 -20
4: 2018-01-02  a  -6  -9   0
5: 2018-01-02  b  -7  -3   0
6: 2018-01-02  c -10  -5   0
7: 2018-01-03  a -13  19 -20
8: 2018-01-03  b  -2  12 -20
9: 2018-01-03  c -14 -16   0

Since the combination of date and id is unique for each row, it's all the more difficult for me to understand why this isn't being evaluated for each group, which is, in this case, each row.

Perhaps I need to use .SDcols = .(date, id) and .SD in the ifelse, but I don't know how to use the .SD in the ifelse.

like image 743
Ursus Frost Avatar asked Oct 13 '18 20:10

Ursus Frost


1 Answers

You need to use pmin instead of min:

DT[, c := ifelse(v1 < 0 & v2 > 0, pmin(-v1, v2), 0)]

> DT
         date id  v1  v2  c
1: 2018-01-01  a   5 -20  0
2: 2018-01-01  b  -8  -6  0
3: 2018-01-01  c -11  14 11
4: 2018-01-02  a  -6  -9  0
5: 2018-01-02  b  -7  -3  0
6: 2018-01-02  c -10  -5  0
7: 2018-01-03  a -13  19 13
8: 2018-01-03  b  -2  12  2
9: 2018-01-03  c -14 -16  0

# see also:
?pmin

pmax*() and pmin*() take one or more vectors as arguments, recycle them to common length and return a single vector giving the ‘parallel’ maxima (or minima) of the argument vectors.

[added later]

Your original code works as well, if you first change the column types:

  DT[, v1:= as.numeric(v1)]   # was integer, converting to 'double'
  DT[, v2:= as.numeric(v2)]   # ---,,---
  DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0), by = list(date, id)]

As far as I understand, the data.table philosophy is not to let R change the column types "implicitly" but that the type would stay until changed explicitly.

The manual says:

Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it’s clearer to readers of your code that you really do intend to change the column type.

So far so good. But of course, the original error message is confusing.

 # To check and coerce would impact performance too much for the fastest cases. 

"For the fastest cases?". This must be one of the fastest cases as the data set is microscopically tiny, and I bet nobody would notice the impact of performance in this case if data.table would allow implicit type conversions. So the primary motivation for this error message seems to be that the package author wants to enforce what he believes to be good practice.

This will also work (without type conversions):

 DT[, c := ifelse(v1 < 0 & v2 > 0, as.numeric(min(-v1, v2)), 0), by = list(date, id)]  # 1

or alternatively:

 DT[, c := ifelse(v1 < 0 & v2 > 0, min(-v1, v2), 0L), by = list(date, id)] # 2

But you cannot run the last two rows -- #1 and #2 -- one after another, the c column has to be deleted first. DT$c will be numeric in the first case and integer in second.

a few additional experiments

DT[, c:= NULL] 
DT[, c := ifelse(v1 < 0, v1, 0), by = list(date, id)] 
# error but DT$c col created with first element as NA
# the condition was FALSE for the first element, so numeric 0 became the first element of c
# ... but the next element would be integer, hence the error
DT$c # [1]  0 NA NA NA NA NA NA NA NA
DT[, c:= NULL] 
DT[, c := ifelse(v1 > 0, v1, 0), by = list(date, id)]
# error; DT$c column is integer, with 5 as first element and the rest as NA 
DT$c # [1]  5 NA NA NA NA NA NA NA NA
DT[, c:= NULL] 
DT[, c := ifelse(v1 < 0, as.numeric(v1), 0), by = list(date, id)] 
# works without error but results in numeric DT$c
is.numeric(DT$c) # TRUE
DT[, c := ifelse(v1 < 0, v1, 0L), by = list(date, id)]
# type error, DT$c was numeric and we are trying to add an integer column
DT[, c:= NULL] # deleting the c column again
DT[, c := ifelse(v1 < 0, v1, 0L), by = list(date, id)]
# no error now
is.integer(DT$c) # TRUE
like image 85
lebatsnok Avatar answered Sep 26 '22 21:09

lebatsnok