Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

filter based on conditional criteria in r

Tags:

r

filter

subset

I have a data frame in my R environment that I would like to subset based on a specific criteria -a sort of conditional filter. My data frame is a panel dataset of daily values for each day between 2004-2014. Each day in the data frame is a separate observation. Each year has 366 days. I would like to subset the data such that only the leap years retain the 366th day in the panel data. There are three leap years in that time range -2004, 2008, 2012. I have a separate column for the year and the day of the year. In other words, I need a script that will return a dataset without the 366th day but only for each year other than 2004, 2008, and 2012.

I've managed to accomplish this the following way: I pasted my day and year columns together (e.g. "2006-366") and simply used dplyr's filter command to subset each year (2005-366, 2006-366, 2007-366, 2009-366, 2010-366, 2011-366, 2013-366, 2014-366). This however is an awfully crude method. I was hoping someone could point me in the right direction here. Here's some reproducible data along with the workflow I used.

 #Create DF
 year<-rep(c(2004:2014), each=366)
 day<-rep(c(1:366))
 df<-data.frame(day, year)

 #My crude method
 df $reduc<-paste(df$year, df$day, sep="-")

 df <-df %>%
    filter(reduc!="2005-366") %>%
    filter(reduc!="2006-366") %>%
    filter(reduc!="2007-366") %>%
    filter(reduc!="2009-366") %>%
    filter(reduc!="2010-366") %>%
    filter(reduc!="2011-366") %>%
    filter(reduc!="2013-366") %>%
    filter(reduc!="2014-366") 
like image 562
Cyrus Mohammadian Avatar asked Jul 11 '16 20:07

Cyrus Mohammadian


2 Answers

Set up data:

df  <- expand.grid(year=2004:2014,day=1:366)
nrow(df) ## 4026

Now exclude cases where (year is not divisible by 4) AND (day equals 366) (identifying non-leap years would be trickier if you included 2000 and/or century-years in your data set ...)

library(dplyr)
df2 <- df %>% filter(!(year %% 4 > 0 & day==366))
like image 71
Ben Bolker Avatar answered Oct 15 '22 10:10

Ben Bolker


You should derive the correct Date values for your dates. This can be done by building the January 1st string representation for each row's year, coercing to Date type, and then adding the day (minus 1) to the Date value.

df$date <- as.Date(paste0(df$year,'-01-01'))+(df$day-1L);

We will then be able to pull out the year from the Date value and check it against the input year. If they fail to match, then we know the year/day combination was invalid, and we can excise it from the data. This works because invalid leap days will translate into January 1st of the following year under the above derivation method.

df[df$year==as.integer(strftime(df$date,'%Y')),];
##      day year       date
## 1      1 2004 2004-01-01
## ...
## 366  366 2004 2004-12-31
## 367    1 2005 2005-01-01
## ...
## 731  365 2005 2005-12-31
## 733    1 2006 2006-01-01
## ...
## 1097 365 2006 2006-12-31
## 1099   1 2007 2007-01-01
## ...
## 1463 365 2007 2007-12-31
## 1465   1 2008 2008-01-01
## ...
## 1830 366 2008 2008-12-31
## 1831   1 2009 2009-01-01
## ...
## 2195 365 2009 2009-12-31
## 2197   1 2010 2010-01-01
## ...
## 2561 365 2010 2010-12-31
## 2563   1 2011 2011-01-01
## ...
## 2927 365 2011 2011-12-31
## 2929   1 2012 2012-01-01
## ...
## 3294 366 2012 2012-12-31
## 3295   1 2013 2013-01-01
## ...
## 3659 365 2013 2013-12-31
## 3661   1 2014 2014-01-01
## ...
## 4025 365 2014 2014-12-31
like image 20
bgoldst Avatar answered Oct 15 '22 09:10

bgoldst