Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using the result of summarise (dplyr) to mutate the original dataframe

I have a rather big dataframe with a column of POSIXct datetimes (~10yr of hourly data). I would flag all the rows in which the day falls in a Daylight saving period. For example if the Daylight shift starts on '2000-04-02 03:00:00' (DOY=93) i would like that the two previous hours of DOY=93 could be flagged. Although I am a newbie of dplyr I would use this package as much as possible and avoid for-loops as much as possible

For example:

library(lubridate)
sd = ymd('2000-01-01',tz="America/Denver")
ed = ymd('2005-12-31',tz="America/Denver")
span = data.frame(date=seq(from=sd,to=ed, by="hour"))
span$YEAR = year(span$date)
span$DOY = yday(span$date)
span$DLS = dst(span$date)

To find the different days of the year in which the daylight saving is applied I use dplyr

library(dplyr)
limits = span %.% group_by(YEAR) %.% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]))

That gives

      YEAR minDOY maxDOY
    1 2000     93    303
    2 2001     91    301
    3 2002     97    300
    4 2003     96    299
    5 2004     95    305
    6 2005     93    303

Now I would 'pipe' the above results in the span dataframe without using a inefficient for-loop.

SOLUTION 1

with the help of @aosmith the problem can be tackled with just two commands (and avoiding the inner_join as in 'solution 2'):

 limits = span %>% group_by(YEAR) %>% mutate(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]),CHECK=FALSE)

 limits$CHECK[(limits2$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE      

SOLUTION 2

With the help of @beetroot and @matthew-plourde, the problem has been solved: an inner-join between was missing:

limits = span %>% group_by(YEAR) %>% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>% inner_join(span, by='YEAR')

Then I just added a new column (CHECK) to fill with the right values for the Daylight-savings days

limits$CHECK = FALSE
limits$CHECK[(limits$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE
like image 768
Fabio Avatar asked Aug 12 '14 14:08

Fabio


People also ask

What does dplyr mutate do?

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. Variables can be removed by setting their value to NULL .

What does dplyr Summarise do?

summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.

What is the difference between Summarise and mutate in R?

mutate() either changes an existing column or adds a new one. summarise() calculates a single value (per group).

What does %>% do in dplyr?

%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).


3 Answers

As @beetroot points out in the comments, you can accomplish this with a join:

limits = span %>% 
   group_by(YEAR) %>% 
   summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>%
   inner_join(span, by='YEAR')
#    YEAR minDOY maxDOY                date DOY   DLS
# 1  2000     93    303 2000-01-01 00:00:00   1 FALSE
# 2  2000     93    303 2000-01-01 01:00:00   1 FALSE
# 3  2000     93    303 2000-01-01 02:00:00   1 FALSE
# 4  2000     93    303 2000-01-01 03:00:00   1 FALSE
# 5  2000     93    303 2000-01-01 04:00:00   1 FALSE
# 6  2000     93    303 2000-01-01 05:00:00   1 FALSE
# 7  2000     93    303 2000-01-01 06:00:00   1 FALSE
# 8  2000     93    303 2000-01-01 07:00:00   1 FALSE
# 9  2000     93    303 2000-01-01 08:00:00   1 FALSE
# 10 2000     93    303 2000-01-01 09:00:00   1 FALSE
like image 115
Matthew Plourde Avatar answered Oct 20 '22 08:10

Matthew Plourde


The best solution to get the job done, as suggested by @aosmith, is.

limits = span %>% group_by(YEAR) %>% mutate(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]),CHECK=FALSE)

limits$CHECK[(limits2$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE

The use of the ave function is a good choice, but I personally prefer to stick to the 'dplyr' package.

like image 25
Fabio Avatar answered Oct 20 '22 10:10

Fabio


dplyr is a great tool, but in this case I'm not sure it's the best for the job. This accomplishes your task:

span$CHECK <- ave(dst(span$date), as.Date(span$date, tz = tz(span$date)), FUN = any)

I think ave is a terrible name for this function, but if you can remember it exists, it's often quite useful when you want to join a summary back to the data.frame it came from.

like image 44
oropendola Avatar answered Oct 20 '22 08:10

oropendola