I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
This recoding is called “dummy coding” and leads to the creation of a table called contrast matrix. This is done automatically by statistical software, such as R. Here, you'll learn how to build and interpret a linear regression model with categorical predictor variables.
In research design, a dummy variable is often used to distinguish different treatment groups. In the simplest case, we would use a 0,1 dummy variable where a person is given a value of 0 if they are in the control group or a 1 if they are in the treated group.
Yes, coefficients of dummy variables can be more than one or less than zero. Remember that you can interpret that coefficient as the mean change in your response (dependent) variable when the dummy changes from 0 to 1, holding all other variables constant (i.e. ceteris paribus).
Here's a base R option using rle
:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle
) function, imo. We first check if the value of temp
is greater than 5 (creating a logical vector) and apply rle
on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values
is TRUE
(i.e. temp is greater than 5) and where at the same time the lengths
is greater than 10 (i.e. at least ten consecutive temp
values are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df)
, we use rep(..., lengths)
and as.integer
in order to return 1/0 instead of TRUE
/FALSE
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With