Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ifelse behavior within data.table (R)

Tags:

r

data.table

I have a data.table full of some consumer products. I've created some distinction for the products as 'low', 'high', or 'unknown' quality. The data are time series, and I'm interested in smoothing out some seasonality in the data. If a product's raw classification (the classification churned out by the algorithm I used to determine quality) is 'low' quality in period X, but its raw classification was 'high' quality in period X-1, I'm reclassifying that product as 'high' quality for period X. This process is done within some sort of product group distinction.

To accomplish this, I've got something like the following:

require(data.table)

# lag takes a column and lags it by one period,
# padding with NA

lag <- function(var) {
    lagged <- c(NA, 
                var[1:(length(var)-1)])
    return(lagged)
}

set.seed(120)

foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)),
                  period = c(1:16),
                  quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE)))

foo[, quality_lag := lag(quality), by = group]

foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high',
                          'high',
                          quality)]

Taking a look at foo:

    group period quality quality_lag quality_1
 1:     A      1 unknown          NA   unknown
 2:     B      2     low          NA        NA
 3:     C      3    high          NA      high
 4:     D      4     low          NA        NA
 5:     B      5 unknown         low   unknown
 6:     C      6    high        high      high
 7:     D      7     low         low       low
 8:     B      8 unknown     unknown   unknown
 9:     C      9    high        high      high
10:     D     10 unknown         low   unknown
11:     B     11 unknown     unknown   unknown
12:     C     12     low        high      high
13:     D     13 unknown     unknown   unknown
14:     B     14    high     unknown      high
15:     C     15    high         low      high
16:     D     16 unknown     unknown   unknown

So, quality_1 is mostly what I want. If period X is 'low' and period X-1 is 'high', we see the reclassification to 'high' occurs and everything is left mostly intact from quality. However, when quality_lag is NA, 'low' gets reclassified to NA in quality_1. This is not an issue with 'high' or 'unknown'.

That is, the first four rows of foo should look like this:

   group period quality quality_lag quality_1
 1:     A      1 unknown          NA   unknown
 2:     B      2     low          NA       low
 3:     C      3    high          NA      high
 4:     D      4     low          NA       low

Any thoughts on what is causing this?

like image 840
thagzone Avatar asked Oct 19 '22 18:10

thagzone


1 Answers

For starters, the Development version on GitHub already has an efficient lag function called shift which can be used both as lag or lead (and has some additional functionality too, see ?shift).

Take also a look here as there is a bunch of other new functions that are now present in v >= 1.9.5

So under v >= 1.9.5 we could simply do

foo[, quality_lag := shift(quality), by = group]

Though even under v < 1.9.5 you could make a use of .N instead of creating this function in the following manner

foo[, quality_lag2 := c(NA, quality[-.N]), by = group]

Regarding your second question, I would recommend avoiding ifelse all together for many reasons specified here

One possible alternative would be, just to use a simple indexing as in

foo[, quality_1 := quality][quality == 'low' & quality_lag == 'high', quality_1 := "high"]

This solution has a bit overhead, of calling [.data.table twice but it will still be much more efficient/safe than the ifelse solution.

like image 50
David Arenburg Avatar answered Oct 26 '22 23:10

David Arenburg