I have a data.table full of some consumer products. I've created some distinction for the products as 'low'
, 'high'
, or 'unknown'
quality. The data are time series, and I'm interested in smoothing out some seasonality in the data. If a product's raw classification (the classification churned out by the algorithm I used to determine quality) is 'low'
quality in period X, but its raw classification was 'high'
quality in period X-1, I'm reclassifying that product as 'high'
quality for period X. This process is done within some sort of product group distinction.
To accomplish this, I've got something like the following:
require(data.table)
# lag takes a column and lags it by one period,
# padding with NA
lag <- function(var) {
lagged <- c(NA,
var[1:(length(var)-1)])
return(lagged)
}
set.seed(120)
foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)),
period = c(1:16),
quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE)))
foo[, quality_lag := lag(quality), by = group]
foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high',
'high',
quality)]
Taking a look at foo
:
group period quality quality_lag quality_1
1: A 1 unknown NA unknown
2: B 2 low NA NA
3: C 3 high NA high
4: D 4 low NA NA
5: B 5 unknown low unknown
6: C 6 high high high
7: D 7 low low low
8: B 8 unknown unknown unknown
9: C 9 high high high
10: D 10 unknown low unknown
11: B 11 unknown unknown unknown
12: C 12 low high high
13: D 13 unknown unknown unknown
14: B 14 high unknown high
15: C 15 high low high
16: D 16 unknown unknown unknown
So, quality_1
is mostly what I want. If period X is 'low'
and period X-1 is 'high'
, we see the reclassification to 'high'
occurs and everything is left mostly intact from quality
. However, when quality_lag
is NA, 'low'
gets reclassified to NA
in quality_1
. This is not an issue with 'high'
or 'unknown'
.
That is, the first four rows of foo
should look like this:
group period quality quality_lag quality_1
1: A 1 unknown NA unknown
2: B 2 low NA low
3: C 3 high NA high
4: D 4 low NA low
Any thoughts on what is causing this?
For starters, the Development version on GitHub already has an efficient lag function called shift
which can be used both as lag or lead (and has some additional functionality too, see ?shift
).
Take also a look here as there is a bunch of other new functions that are now present in v >= 1.9.5
So under v >= 1.9.5 we could simply do
foo[, quality_lag := shift(quality), by = group]
Though even under v < 1.9.5 you could make a use of .N
instead of creating this function in the following manner
foo[, quality_lag2 := c(NA, quality[-.N]), by = group]
Regarding your second question, I would recommend avoiding ifelse
all together for many reasons specified here
One possible alternative would be, just to use a simple indexing as in
foo[, quality_1 := quality][quality == 'low' & quality_lag == 'high', quality_1 := "high"]
This solution has a bit overhead, of calling [.data.table
twice but it will still be much more efficient/safe than the ifelse
solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With