I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.
Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.
set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
The best I've got so far with dplyr is the following (but this loses the identity variables):
df1 %>%
rowwise() %>%
do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
mutate(n=nrow(count))
output:
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 × 2
count n
<list> <int>
1 <data.frame [17 × 1]> 17
2 <data.frame [18 × 1]> 18
3 <data.frame [10 × 1]> 10
I was expecting to be able to do this:
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>%
group_by(id) %>%
mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))
but this returns the error:
Error: comparison (6) is possible only for atomic and list types
What is the dplyr way of doing this?
mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name.
The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.
The dplyr function arrange() can be used to reorder (or sort) rows by one or more variables.
Here is one option with data.table
where we can use the non-equi
joins
library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew),
by = .EACHI][, c('id', 'N'), with = FALSE]
# id N
#1: A 17
#2: B 18
#3: C 10
which gives the same count as in the OP's base R
method
sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10
If we need the 'id' variable also as output in dplyr
, we can modify the OP's code
df1 %>%
rowwise() %>%
do(data.frame(., count = filter(df2, time>.$start_time,
time < (.$start_time + 15)))) %>%
group_by(id) %>%
summarise(n = n())
# id n
# <fctr> <int>
#1 A 17
#2 B 18
#3 C 10
Or another option is map
from purrr
with dplyr
library(purrr)
df1 %>%
split(.$id) %>%
map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>%
select(-start_time)
# id N
#1 A 17
#2 B 18
#3 C 10
Another slightly different approach using dplyr
:
result <- df1 %>% group_by(id) %>%
summarise(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
### A tibble: 3 x 2
## id count
## <fctr> <int>
##1 A 17
##2 B 18
##3 C 10
I believe you can use length
and which
to count the number of occurrences for which your condition is true for each id
in df1
. Then, group by id
and use this to summarise
.
If there are possibly more that one start_time
per id
, then you can use the same function but rowwise
and with mutate
:
result <- df1 %>% rowwise() %>%
mutate(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
## id start_time count
## <fctr> <dbl> <int>
##1 A 10 17
##2 B 20 18
##3 C 30 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With