Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Conditionally extract data from one dataframe to another

I have two dataframes and I want to conditionally extract data from one column of one dataframe and put it into a new columnn of another datafrmae.

dataframe 1 looks like this:

df1 <- data.frame(date.start = c("2019-06-10 11:52:00",
  "2019-06-11 11:52:00", "2019-06-12 11:51:00"), date.end =
  c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"))

dataframe 2 looks like this:

df2 <- data.frame(date.start = c("2019-06-11 11:50:00",
  "2019-06-10 11:51:00", "2019-06-12 11:50:00"), date.end =
  c("2019-06-11 11:54:00", "2019-06-11 08:59:00", "2019-06-12 11:57:00"),
  day = c(1, 15, 64))

If the date.start and date.end of df.1 fall within the date.start or date.end of any row of df2 I want to extract the variable day from df2 and put it in to the matching row of df1.

The expected outcome looks like this:

expected.out <- data.frame(date.start = c("2019-06-10 11:52:00", "2019-06-11 11:52:00", "2019-06-12 11:51:00"),
                           date.end = c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"),
                           day = c(15, 1, 64))

I currently have the following loop that works, but it is very slow when I run it on my large dataframe (rows = 1135133), and I am wondering if there is a faster way of doing this.

for(i in 1:nrow(df1)){
  find.match <- which(df1$date.start[i] >= df2$date.start &
                        df1$date.end[i] <= df2$date.end)
  if(length(find.match) !=0){
    df1$day[i] <- df2$day[find.match]
  }
  
}
like image 470
alex Avatar asked Mar 02 '23 01:03

alex


1 Answers

use library(fuzzyjoin)

library(tidyverse)
library(lubridate)
library(fuzzyjoin)

df1 <- data.frame(
  date.start = c("2019-06-10 11:52:00", "2019-06-11 11:52:00", "2019-06-12 11:51:00"),
  date.end = c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"), stringsAsFactors = F)

df2 <- data.frame(date.start = c("2019-06-11 11:50:00", "2019-06-10 11:51:00", "2019-06-12 11:50:00"),
                  date.end = c("2019-06-11 11:54:00", "2019-06-11 08:59:00", "2019-06-12 11:57:00"),
                  day = c(1, 15, 64), stringsAsFactors = F)

df1 <- df1 %>% 
  mutate(across(where(is.character), ymd_hms)) %>% 
  as_tibble()

df2 <- df2 %>% 
  mutate(across(where(is.character), ymd_hms)) %>% 
  as_tibble()


fuzzy_left_join(df1, df2, by = c("date.start", "date.end"), match_fun = list(`>=`, `<=`))
# A tibble: 3 x 5
  date.start.x        date.end.x          date.start.y        date.end.y            day
  <dttm>              <dttm>              <dttm>              <dttm>              <dbl>
1 2019-06-10 11:52:00 2019-06-10 11:53:00 2019-06-10 11:51:00 2019-06-11 08:59:00    15
2 2019-06-11 11:52:00 2019-06-11 11:53:00 2019-06-11 11:50:00 2019-06-11 11:54:00     1
3 2019-06-12 11:51:00 2019-06-12 11:53:00 2019-06-12 11:50:00 2019-06-12 11:57:00    64

Created on 2020-09-23 by the reprex package (v0.3.0)

not sure if the method is fast

like image 194
Yuriy Saraykin Avatar answered Mar 05 '23 00:03

Yuriy Saraykin