When I filter a dataset based on a lag() function, I lose the first row in each group (because those rows have no lag value). How can I avoid this so that I keep the first rows despite their not having any lag value?
ds <-
structure(list(mpg = c(21, 21, 21.4, 18.7, 14.3, 16.4), cyl = c(6,
6, 6, 8, 8, 8), hp = c(110, 110, 110, 175, 245, 180)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("mpg",
"cyl", "hp"))
# example of filter based on lag that drops first rows
ds %>%
group_by(cyl) %>%
arrange(-mpg) %>%
filter(hp <= lag(hp))
Having filter(hp <= lag(hp))
excludes rows where lag(hp)
is NA
. You can instead filter for either that inequality or for lag(hp)
, as is the case for those top rows of each group.
I included prev = lag(hp)
to make a standalone variable for the lags, just for clarity & debugging.
library(tidyverse)
ds %>%
group_by(cyl) %>%
arrange(-mpg) %>%
mutate(prev = lag(hp)) %>%
filter(hp <= prev | is.na(prev))
This yields:
# A tibble: 4 x 4
# Groups: cyl [2]
mpg cyl hp prev
<dbl> <dbl> <dbl> <dbl>
1 21.4 6. 110. NA
2 21.0 6. 110. 110.
3 21.0 6. 110. 110.
4 18.7 8. 175. NA
Since OP
intends to use <=
(less than or equal to) with previous value, hence using lag
with default = +Inf
will be sufficient.
Also, there is no need to have separate arrange
call in dplyr
chain as lag
provides option to select order_by
.
Hence, solution can be written as:
ds %>%
group_by(cyl) %>%
filter(hp <= lag(hp, default = +Inf, order_by = -mpg))
#Below result is in origianl order of the data.frame though lag was calculated
#in ordered value of mpg
# # A tibble: 4 x 3
# # Groups: cyl [2]
# mpg cyl hp
# <dbl> <dbl> <dbl>
# 1 21.0 6.00 110
# 2 21.0 6.00 110
# 3 21.4 6.00 110
# 4 18.7 8.00 175
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With