I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'. Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g. <pre class="prettyprint"><code>set.seed(1337) df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10) df2 <- data.frame(time=runif(100)*100) lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15))) </code></pre> The best I've got so far with dplyr is the following (but this loses the identity variables): <pre class="prettyprint"><code>df1 %>% rowwise() %>% do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>% mutate(n=nrow(count)) </code></pre> output: <pre class="prettyprint"><code>Source: local data frame [3 x 2] Groups: <by row> # A tibble: 3 × 2 count n <list> <int> 1 <data.frame [17 × 1]> 17 2 <data.frame [18 × 1]> 18 3 <data.frame [10 × 1]> 10 </code></pre> I was expecting to be able to do this: <pre class="prettyprint"><code>df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10) df2 <- data.frame(time=runif(100)*100) df1 %>% group_by(id) %>% mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15)))) </code></pre> but this returns the error: <pre class="prettyprint"><code>Error: comparison (6) is possible only for atomic and list types </code></pre> What is the dplyr way of doing this?

Here is one option with <code>data.table</code> where we can use the <code>non-equi</code> joins <pre class="prettyprint"><code>library(data.table)#1.9.7+ setDT(df1)[, start_timeNew := start_time + 15] setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew), by = .EACHI][, c('id', 'N'), with = FALSE] # id N #1: A 17 #2: B 18 #3: C 10 </code></pre> which gives the same count as in the OP's <code>base R</code> method <pre class="prettyprint"><code>sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15))) #[1] 17 18 10 </code></pre> <hr> If we need the 'id' variable also as output in <code>dplyr</code>, we can modify the OP's code <pre class="prettyprint"><code>df1 %>% rowwise() %>% do(data.frame(., count = filter(df2, time>.$start_time, time < (.$start_time + 15)))) %>% group_by(id) %>% summarise(n = n()) # id n # <fctr> <int> #1 A 17 #2 B 18 #3 C 10 </code></pre> Or another option is <code>map</code> from <code>purrr</code> with <code>dplyr</code> <pre class="prettyprint"><code>library(purrr) df1 %>% split(.$id) %>% map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>% select(-start_time) # id N #1 A 17 #2 B 18 #3 C 10 </code></pre>

Another slightly different approach using <code>dplyr</code>: <pre class="prettyprint"><code>result <- df1 %>% group_by(id) %>% summarise(count = length(which(df2$time > start_time & df2$time < (start_time+15)))) print(result) ### A tibble: 3 x 2 ## id count ## <fctr> <int> ##1 A 17 ##2 B 18 ##3 C 10 </code></pre> I believe you can use <code>length</code> and <code>which</code> to count the number of occurrences for which your condition is true for each <code>id</code> in <code>df1</code>. Then, group by <code>id</code> and use this to <code>summarise</code>. <hr> If there are possibly more that one <code>start_time</code> per <code>id</code>, then you can use the same function but <code>rowwise</code> and with <code>mutate</code>: <pre class="prettyprint"><code>result <- df1 %>% rowwise() %>% mutate(count = length(which(df2$time > start_time & df2$time < (start_time+15)))) print(result) ##Source: local data frame [3 x 3] ##Groups: <by row> ## ### A tibble: 3 x 3 ## id start_time count ## <fctr> <dbl> <int> ##1 A 10 17 ##2 B 20 18 ##3 C 30 10 </code></pre>

dplyr mutate calling another dataframe

Tags:

r

dplyr

I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.

Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.

set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))

The best I've got so far with dplyr is the following (but this loses the identity variables):

df1 %>% 
    rowwise() %>%
    do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
    mutate(n=nrow(count))

output:

Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 × 2
                  count     n
                 <list> <int>
1 <data.frame [17 × 1]>    17
2 <data.frame [18 × 1]>    18
3 <data.frame [10 × 1]>    10

I was expecting to be able to do this:

df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>% 
    group_by(id) %>% 
    mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))

but this returns the error:

Error: comparison (6) is possible only for atomic and list types

What is the dplyr way of doing this?

334

asked Sep 25 '16 16:09

James Owers

2 Answers

Here is one option with data.table where we can use the non-equi joins

library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew), 
           by = .EACHI][, c('id', 'N'), with = FALSE]
#   id  N
#1:  A 17
#2:  B 18
#3:  C 10

which gives the same count as in the OP's base R method

sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10

If we need the 'id' variable also as output in dplyr, we can modify the OP's code

df1 %>%
    rowwise() %>% 
    do(data.frame(., count = filter(df2, time>.$start_time,
                                 time < (.$start_time + 15)))) %>% 
    group_by(id) %>% 
    summarise(n = n())
#      id     n
#  <fctr> <int>
#1      A    17
#2      B    18
#3      C    10

Or another option is map from purrr with dplyr

library(purrr)
df1 %>% 
    split(.$id) %>% 
    map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>% 
    select(-start_time)
#   id  N
#1  A 17
#2  B 18
#3  C 10

135

answered Sep 19 '22 03:09

akrun

Another slightly different approach using dplyr:

result <- df1 %>% group_by(id) %>% 
                  summarise(count = length(which(df2$time > start_time &
                                                 df2$time < (start_time+15))))

print(result)
### A tibble: 3 x 2
##      id count
##  <fctr> <int>
##1      A    17
##2      B    18
##3      C    10

I believe you can use length and which to count the number of occurrences for which your condition is true for each id in df1. Then, group by id and use this to summarise.

If there are possibly more that one start_time per id, then you can use the same function but rowwise and with mutate:

result <- df1 %>% rowwise() %>% 
                  mutate(count = length(which(df2$time > start_time & 
                                              df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
##      id start_time count
##  <fctr>      <dbl> <int>
##1      A         10    17
##2      B         20    18
##3      C         30    10

answered Sep 21 '22 03:09

aichao

Related questions
                            
                                R: how to define multiple constructors for an R6 class?
                            
                                R - ggplot2 change x-axis values to non-log values
                            
                                R Shiny app progress Indicator for loading data
                            
                                Mean excluding zero and na for all columns with dplyr
                            
                                ggplot reduce line weight in legend
                            
                                Can R visualize the t.test or other hypothesis test results?
                            
                                Shiny Application actionButton click on page load
                            
                                R shinydashboard dynamic menu selection
                            
                                Is there a logical way to think about List Indexing?
                            
                                Using anonymous functions with summarize_each or mutate_each
                            
                                Subtracting every two columns
                            
                                ggplot2: add regression equations and R2 and adjust their positions on plot
                            
                                wide format with dcast data.table [closed]
                            
                                Add line numbers to text content of a rendered rmarkdown html document
                            
                                Align strings of a dataframe in columns in r
                            
                                How to use non-default browser?
                            
                                debugging: function to create multiple lags for multiple columns (dplyr)
                            
                                R: Count objects in a picture
                            
                                Countdown Timer in R shiny?
                            
                                Searching functions using grep over multiple loaded packages in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With