BACKGROUD
dplyr has window functions. When you want to control the order of window functions,
you can use order_by
.
DATA
mydf <- data.frame(id = c("ana", "bob", "caroline",
"bob", "ana", "caroline"),
order = as.POSIXct(c("2015-01-01 18:00:00", "2015-01-01 18:05:00",
"2015-01-01 19:20:00", "2015-01-01 09:07:00",
"2015-01-01 08:30:00", "2015-01-01 11:11:00"),
format = "%Y-%m-%d %H:%M:%S"),
value = runif(6, 10, 20),
stringsAsFactors = FALSE)
# id order value
#1 ana 2015-01-01 18:00:00 19.00659
#2 bob 2015-01-01 18:05:00 13.64010
#3 caroline 2015-01-01 19:20:00 12.08506
#4 bob 2015-01-01 09:07:00 14.40996
#5 ana 2015-01-01 08:30:00 17.45165
#6 caroline 2015-01-01 11:11:00 14.50865
Suppose you want to use lag()
, you can do the following.
arrange(mydf, id, order) %>%
group_by(id) %>%
mutate(check = lag(value))
# id order value check
#1 ana 2015-01-01 08:30:00 17.45165 NA
#2 ana 2015-01-01 18:00:00 19.00659 17.45165
#3 bob 2015-01-01 09:07:00 14.40996 NA
#4 bob 2015-01-01 18:05:00 13.64010 14.40996
#5 caroline 2015-01-01 11:11:00 14.50865 NA
#6 caroline 2015-01-01 19:20:00 12.08506 14.50865
However, you can avoid using arrange()
with order_by()
.
group_by(mydf, id) %>%
mutate(check = lag(value, order_by = order))
# id order value check
#1 ana 2015-01-01 18:00:00 19.00659 17.45165
#2 bob 2015-01-01 18:05:00 13.64010 14.40996
#3 caroline 2015-01-01 19:20:00 12.08506 14.50865
#4 bob 2015-01-01 09:07:00 14.40996 NA
#5 ana 2015-01-01 08:30:00 17.45165 NA
#6 caroline 2015-01-01 11:11:00 14.50865 NA
EXPERIMENT
I wanted to apply the same procedure to the case in which I wanted to assign row number to a new column. Using the sample data, you can do the folowing.
group_by(mydf, id) %>%
arrange(order) %>%
mutate(num = row_number())
# id order value num
#1 ana 2015-01-01 08:30:00 17.45165 1
#2 ana 2015-01-01 18:00:00 19.00659 2
#3 bob 2015-01-01 09:07:00 14.40996 1
#4 bob 2015-01-01 18:05:00 13.64010 2
#5 caroline 2015-01-01 11:11:00 14.50865 1
#6 caroline 2015-01-01 19:20:00 12.08506 2
Can we omit the arrange line? Seeing the CRAN manual, I did the following. Both attempts were not successful.
### Not working
group_by(mydf, id) %>%
mutate(num = row_number(order_by = order))
### Not working
group_by(mydf, id) %>%
mutate(num = order_by(order, row_number()))
How can we achieve this?
None of the built in functions have an order_by argument so dplyr provides a helper: order_by (). You give it the variable you want to order by, and then the call to the window function:
How To Sort or Reorder Rows in Descending Order with dplyr’s arrange ()? By default, dplyr’s arrange () sorts in ascending order. We can sort by a variable in descending order using desc () function on the variable we want to sort by.
order_by: A helper function for ordering window function output Description. This function makes it possible to control the ordering of window functions in R that don't have a specific ordering parameter. When translated to SQL it will modify the order clause of the OVER function. Usage order_by(order_by, call) Arguments
We can sort by a variable in descending order using desc () function on the variable we want to sort by. For example, to sort the dataframe by body_mass_g in descending order we use How To Sort a Dataframe by Two Variables? With dplyr’s arrange () function we can sort by more than one variable.
I did not mean to answer this question by myself. But, I decided to share
what I found given I have not seen many posts using order_by
and particularly
with_order
. My answer was to use with_order()
instead of order_by()
.
group_by(mydf, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))
# id order value num
#1 ana 2015-01-01 18:00:00 19.00659 2
#2 bob 2015-01-01 18:05:00 13.64010 2
#3 caroline 2015-01-01 19:20:00 12.08506 2
#4 bob 2015-01-01 09:07:00 14.40996 1
#5 ana 2015-01-01 08:30:00 17.45165 1
#6 caroline 2015-01-01 11:11:00 14.50865 1
I wanted to see if there would be any difference between the two approaches in terms of speed. It seems that they are pretty similar in this case.
library(microbenchmark)
mydf2 <- data.frame(id = rep(c("ana", "bob", "caroline",
"bob", "ana", "caroline"), times = 200000),
order = seq(as.POSIXct("2015-03-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
as.POSIXct("2015-01-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
length.out = 1200000),
value = runif(1200000, 10, 20),
stringsAsFactors = FALSE)
jazz1 <- function() {group_by(mydf2, id) %>%
arrange(order) %>%
mutate(num = row_number())}
jazz2 <- function() {group_by(mydf2, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))}
res <- microbenchmark(jazz1, jazz2, times = 1000000L)
res
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# jazz1 32 36 47.17647 38 47 12308 1e+06 a
# jazz2 32 36 47.02902 38 47 12402 1e+06 a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With