dplyr window functions with order_by and with_order

Tags:

dplyr

BACKGROUD

dplyr has window functions. When you want to control the order of window functions, you can use order_by.

DATA

mydf <- data.frame(id = c("ana", "bob", "caroline",
                          "bob", "ana", "caroline"),
                   order = as.POSIXct(c("2015-01-01 18:00:00", "2015-01-01 18:05:00",
                                        "2015-01-01 19:20:00", "2015-01-01 09:07:00",
                                        "2015-01-01 08:30:00", "2015-01-01 11:11:00"),
                                        format = "%Y-%m-%d %H:%M:%S"),  
                   value = runif(6, 10, 20),
                   stringsAsFactors = FALSE)

#        id               order    value
#1      ana 2015-01-01 18:00:00 19.00659
#2      bob 2015-01-01 18:05:00 13.64010
#3 caroline 2015-01-01 19:20:00 12.08506
#4      bob 2015-01-01 09:07:00 14.40996
#5      ana 2015-01-01 08:30:00 17.45165
#6 caroline 2015-01-01 11:11:00 14.50865

Suppose you want to use lag(), you can do the following.

arrange(mydf, id, order) %>%
group_by(id) %>%
mutate(check = lag(value))

#        id               order    value    check
#1      ana 2015-01-01 08:30:00 17.45165       NA
#2      ana 2015-01-01 18:00:00 19.00659 17.45165
#3      bob 2015-01-01 09:07:00 14.40996       NA
#4      bob 2015-01-01 18:05:00 13.64010 14.40996
#5 caroline 2015-01-01 11:11:00 14.50865       NA
#6 caroline 2015-01-01 19:20:00 12.08506 14.50865

However, you can avoid using arrange() with order_by().

group_by(mydf, id) %>%
mutate(check = lag(value, order_by = order))

#        id               order    value    check
#1      ana 2015-01-01 18:00:00 19.00659 17.45165
#2      bob 2015-01-01 18:05:00 13.64010 14.40996
#3 caroline 2015-01-01 19:20:00 12.08506 14.50865
#4      bob 2015-01-01 09:07:00 14.40996       NA
#5      ana 2015-01-01 08:30:00 17.45165       NA
#6 caroline 2015-01-01 11:11:00 14.50865       NA

EXPERIMENT

I wanted to apply the same procedure to the case in which I wanted to assign row number to a new column. Using the sample data, you can do the folowing.

group_by(mydf, id) %>%
arrange(order) %>%
mutate(num = row_number())

#        id               order    value num
#1      ana 2015-01-01 08:30:00 17.45165   1
#2      ana 2015-01-01 18:00:00 19.00659   2
#3      bob 2015-01-01 09:07:00 14.40996   1
#4      bob 2015-01-01 18:05:00 13.64010   2
#5 caroline 2015-01-01 11:11:00 14.50865   1
#6 caroline 2015-01-01 19:20:00 12.08506   2

Can we omit the arrange line? Seeing the CRAN manual, I did the following. Both attempts were not successful.

### Not working
group_by(mydf, id) %>%
mutate(num = row_number(order_by = order))

### Not working
group_by(mydf, id) %>%
mutate(num = order_by(order, row_number()))

How can we achieve this?

759

asked Feb 16 '15 08:02

1 Answers

I did not mean to answer this question by myself. But, I decided to share what I found given I have not seen many posts using order_by and particularly with_order. My answer was to use with_order() instead of order_by().

group_by(mydf, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))

#        id               order    value num
#1      ana 2015-01-01 18:00:00 19.00659   2
#2      bob 2015-01-01 18:05:00 13.64010   2
#3 caroline 2015-01-01 19:20:00 12.08506   2
#4      bob 2015-01-01 09:07:00 14.40996   1
#5      ana 2015-01-01 08:30:00 17.45165   1
#6 caroline 2015-01-01 11:11:00 14.50865   1

I wanted to see if there would be any difference between the two approaches in terms of speed. It seems that they are pretty similar in this case.

library(microbenchmark)

mydf2 <- data.frame(id = rep(c("ana", "bob", "caroline",
                               "bob", "ana", "caroline"), times = 200000),
                    order = seq(as.POSIXct("2015-03-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
                                as.POSIXct("2015-01-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
                                length.out = 1200000),
                    value = runif(1200000, 10, 20),
                    stringsAsFactors = FALSE)

jazz1 <- function() {group_by(mydf2, id) %>%
                     arrange(order) %>%
                     mutate(num = row_number())}

jazz2 <- function() {group_by(mydf2, id) %>%
                     mutate(num = with_order(order_by = order, fun = row_number, x = order))}


res <- microbenchmark(jazz1, jazz2, times = 1000000L)
res

#Unit: nanoseconds
#  expr min lq     mean median uq   max neval cld
# jazz1  32 36 47.17647     38 47 12308 1e+06   a
# jazz2  32 36 47.02902     38 47 12402 1e+06   a

140

answered Oct 19 '22 14:10

jazzurro

Related questions
                            
                                Changing lubridate function to start on Monday rather than Sunday
                            
                                replace <NA> with NA
                            
                                possible bug in `rbinom()` for large numbers of trials
                            
                                R replacing zeros in dataframe with next non zero value
                            
                                keep only hour: minute:second from a "POSIXlt" "POSIXt" object
                            
                                Convert FIX message format ("Tag=Value") into CSV
                            
                                RHbase/thrift install issue
                            
                                Avoid conflicts between vector and variable name in dplyr
                            
                                Import multiple json files from a directory and attaching the data
                            
                                how to check if page finished loading in RSelenium
                            
                                Real-time auto incrementing ggplot in R
                            
                                How do you conduct a power analysis for logistic regression in R?
                            
                                Visualizing the Kolmogorov-Smirnov statistic in ggplot2
                            
                                Temporarily change locale settings
                            
                                How to extract attributes values from svyciprop object?
                            
                                ggvis line chart with interactive x axis range
                            
                                Adding a legend/text table to a ggmap with multiple layers
                            
                                How to use validity functions correctly with inherited S4 classes in R
                            
                                Retrieving sentence score based on values of words in a dictionary
                            
                                Calculate curvature from smooth.spline in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

dplyr window functions with order_by and with_order

Tags:

r

dplyr

jazzurro

People also ask

1 Answers

jazzurro

Recent Activity

Donate For Us