Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr mutate calling another dataframe

Tags:

r

dplyr

I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.

Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.

set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))

The best I've got so far with dplyr is the following (but this loses the identity variables):

df1 %>% 
    rowwise() %>%
    do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
    mutate(n=nrow(count))

output:

Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 × 2
                  count     n
                 <list> <int>
1 <data.frame [17 × 1]>    17
2 <data.frame [18 × 1]>    18
3 <data.frame [10 × 1]>    10

I was expecting to be able to do this:

df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>% 
    group_by(id) %>% 
    mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))

but this returns the error:

Error: comparison (6) is possible only for atomic and list types

What is the dplyr way of doing this?

like image 334
James Owers Avatar asked Sep 25 '16 16:09

James Owers


People also ask

What does dplyr mutate do?

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name.

How does Group_by work in R?

The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.

Which function in dplyr reorders the rows?

The dplyr function arrange() can be used to reorder (or sort) rows by one or more variables.


2 Answers

Here is one option with data.table where we can use the non-equi joins

library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew), 
           by = .EACHI][, c('id', 'N'), with = FALSE]
#   id  N
#1:  A 17
#2:  B 18
#3:  C 10

which gives the same count as in the OP's base R method

sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10

If we need the 'id' variable also as output in dplyr, we can modify the OP's code

df1 %>%
    rowwise() %>% 
    do(data.frame(., count = filter(df2, time>.$start_time,
                                 time < (.$start_time + 15)))) %>% 
    group_by(id) %>% 
    summarise(n = n())
#      id     n
#  <fctr> <int>
#1      A    17
#2      B    18
#3      C    10

Or another option is map from purrr with dplyr

library(purrr)
df1 %>% 
    split(.$id) %>% 
    map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>% 
    select(-start_time)
#   id  N
#1  A 17
#2  B 18
#3  C 10
like image 135
akrun Avatar answered Sep 19 '22 03:09

akrun


Another slightly different approach using dplyr:

result <- df1 %>% group_by(id) %>% 
                  summarise(count = length(which(df2$time > start_time &
                                                 df2$time < (start_time+15))))

print(result)
### A tibble: 3 x 2
##      id count
##  <fctr> <int>
##1      A    17
##2      B    18
##3      C    10

I believe you can use length and which to count the number of occurrences for which your condition is true for each id in df1. Then, group by id and use this to summarise.


If there are possibly more that one start_time per id, then you can use the same function but rowwise and with mutate:

result <- df1 %>% rowwise() %>% 
                  mutate(count = length(which(df2$time > start_time & 
                                              df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
##      id start_time count
##  <fctr>      <dbl> <int>
##1      A         10    17
##2      B         20    18
##3      C         30    10
like image 39
aichao Avatar answered Sep 21 '22 03:09

aichao