I have these three intervals defined:
YEAR_1 <- interval(ymd('2002-09-01'), ymd('2003-08-31'))
YEAR_2 <- interval(ymd('2003-09-01'), ymd('2004-08-31'))
YEAR_3 <- interval(ymd('2004-09-01'), ymd('2005-08-31'))
(in real life, I have 50 of these)
I have a dataframe (called df
) with a column full of lubridate formatted dates.
I'd like to append a new column on df
which has the appropriate value YEAR_n
, depending on which interval the date falls within.
Something like:
df$YR <- ifelse(df$DATE %within% YEAR_1, 1, NA)
but I'm not sure how to proceed. I need to somehow use an apply
I think?
Here's my dataframe:
structure(c(1055289600, 1092182400, 1086220800, 1074556800, 1109289600,
1041897600, 1069200000, 1047427200, 1072656000, 1048636800, 1092873600,
1090195200, 1051574400, 1052179200, 1130371200, 1242777600, 1140652800,
1137974400, 1045526400, 1111104000, 1073952000, 1052870400, 1087948800,
1053993600, 1039564800, 1141603200, 1074038400, 1105315200, 1060560000,
1072051200, 1046217600, 1107129600, 1088553600, 1071619200, 1115596800,
1050364800, 1147046400, 1083628800, 1056412800, 1159747200, 1087257600,
1201478400, 1120521600, 1066176000, 1034553600, 1057622400, 1078876800,
1010880000, 1133913600, 1098230400, 1170806400, 1037318400, 1070409600,
1091577600, 1057708800, 1182556800, 1091059200, 1058227200, 1061337600,
1034121600, 1067644800, 1039478400, 1022198400, 1063065600, 1096329600,
1049760000, 1081728000, 1016150400, 1029801600, 1059350400, 1087257600,
1181692800, 1310947200, 1125446400, 1057104000, NA, 1085529600,
1037664000, 1091577600, 1080518400, 1110758400, 1092787200, 1094601600,
1169424000, 1232582400, 1058918400, 1021420800, 1133136000, 1030320000,
1060732800, 1035244800, 1090800000, 1129161600, 1055808000, 1060646400,
1028678400, 1075852800, 1144627200, 1111363200, 1070236800), class = c("POSIXct",
"POSIXt"), tzone = "UTC")
Everybody has their favourite tool for this, mine happens to be data.table because of what it refers to as its dt[i, j, by]
logic.
library(data.table)
dt <- data.table(date = as.IDate(pt))
dt[, YR := 0.0 ] # I am using a numeric for year here...
dt[ date >= as.IDate("2002-09-01") & date <= as.IDate("2003-08-31"), YR := 1 ]
dt[ date >= as.IDate("2003-09-01") & date <= as.IDate("2004-08-31"), YR := 2 ]
dt[ date >= as.IDate("2004-09-01") & date <= as.IDate("2005-08-31"), YR := 3 ]
I create a data.table
object, converting your times to date for later comparison. I then set up a new column, defaulting to one.
We then execute three conditional statements: for each of the three intervals (which I just create by hand using the endpoints), we set the YR
value to 1, 2 or 3.
This does have the desired effect as we can see from
R> print(dt, topn=5, nrows=10)
date YR
1: 2003-06-11 1
2: 2004-08-11 2
3: 2004-06-03 2
4: 2004-01-20 2
5: 2005-02-25 3
---
96: 2002-08-07 0
97: 2004-02-04 2
98: 2006-04-10 0
99: 2005-03-21 3
100: 2003-12-01 2
R> table(dt[, YR])
0 1 2 3
26 31 31 12
R>
One could have done this also simply by computing date differences and truncating down, but it is also nice to be a little explicit at times.
Edit: A more generic form just uses arithmetic on the dates:
R> dt[, YR2 := trunc(as.numeric(difftime(as.Date(date),
+ as.Date("2001-09-01"),
+ unit="days"))/365.25)]
R> table(dt[, YR2])
0 1 2 3 4 5 6 7 9
7 31 31 12 9 5 1 2 1
R>
This does the job in one line.
You can use walk
from package purrr
for this:
purrr::walk(1:3, ~(df$Year[as.POSIXlt(df$DATE) %within% get(paste0("YEAR_", .))] <<- .))
or maybe you should write a loop to improve readability (unless taboo for you):
df$YR <- NA
for(i in 1:3){
interval <- get(paste0("YEAR_", i))
index <-which(as.POSIXlt(df$DATE) %within% interval)
df$YR[index] <- i
}
With lubridate
and mapply
:
library(lubridate)
dates <- # your data here
# no idea how you generated these, so let's just copy them
YEAR_1 <- interval(ymd('2002-09-01'), ymd('2003-08-31'))
YEAR_2 <- interval(ymd('2003-09-01'), ymd('2004-08-31'))
YEAR_3 <- interval(ymd('2004-09-01'), ymd('2005-08-31'))
# this should scale nicely
sapply(c(YEAR_1, YEAR_2, YEAR_3), function(x) { mapply(`%within%`, dates, x) })
The result is a matrix with one column per interval:
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE TRUE FALSE
[4,] FALSE TRUE FALSE
... etc. (100 rows in your example data)
There might be a nicer way to code that with purrr
, but I am too novice to purrr
to see it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With