Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two data frames in long format based on date

I have a 2 data frames, one (df1) that records the daily occurrence of different activities and another (df2) that records properties of the occurred activity during the day.

From df1 it is possible to identify the repeated occurrence of an activity as well the duration. When the day starts is specified by the Date variable.

For example:

  • id 12 the occurrence starts at day1 and ends at d7. In this case the occurrence is 7 and duration is 11.
  • for id 123 the week starts at day 5 and ends at d7; occurred in repeated order because of there are gap days at day 6 and duration is 6 and id 123 (starts at day6 ends at day 7) occurred 2 times consecutively and duration 6.

In df1 the variable Date defines the day when the record started. For example id 12 record started at day1 and so on.

I would like to identify if during the consecutive occurrence if there are records on the activity properties in df2.

For example id 12, occurred 7 times and duration is 12 there is record for Wednesday (day3 in df1) and this record corresponds to the 3 day of the consecutive occurrence. For id 123 there is no data (eg. no consecutive occurrence) but for id 10 for 6 day occurrence and duration 18 there is a record on the 6th day.

Df1:

id   day1 day2 day3 day4 day5 day6  day7   Date
 12    2    1    2    1    1    3    1     Mon
123    0    3    0    3    3    0    3     Fri
 10    0    3    3    3    3    3    3     Sat

Df2:

    id   c1 c2  Date
    12   3   3   Wednesday
   123   3   2   Fri
     10  3   1   Sat

Outcome:

 id c1 c2  Occurrence Position
 12 3   3     7          3
123 0   0     0          0
 10 3   1     2          1

Sample data: df1

structure(list(id = c(12L, 123L, 10L), day1 = c(2L, 0L, 3L), 
    day2 = c(1L, 3L, 3L), day3 = c(2L, 0L, 3L), day4 = c(1L, 
    3L, 3L), day5 = c(1L, 3L, 3L), day6 = c(3L, 0L, 3L), day7 = c(1L, 
    3L, 3L), Date = c("Monday", "Friday", "Saturday")), row.names = c(NA, 
-3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000002a81a571ef0>)

df2:

structure(list(id = c(12, 123, 10), c1 = c(3, 3, 3), c2 = c(3, 
2, 1), Date = structure(c(3L, 1L, 2L), .Label = c("Friday", "Saturday", 
"Wednesday"), class = "factor")), row.names = c(NA, -3L), class = "data.frame")
like image 502
Rstudent Avatar asked Nov 06 '22 08:11

Rstudent


1 Answers

A solution with dplyr (maybe not the shortest one):

# library
library(tidyverse)

# get data
df1 <- structure(list(id = c(12L, 123L, 10L), 
               day1 = c(2L, 0L, 3L), 
               day2 = c(1L, 3L, 3L), 
               day3 = c(2L, 0L, 3L), 
               day4 = c(1L,3L, 3L), 
               day5 = c(1L, 3L, 3L), 
               day6 = c(3L, 0L, 3L), 
               day7 = c(1L,3L, 3L), 
               Date = c("Monday", "Friday", "Saturday")),
               row.names = c(NA,-3L), class = c("data.table", "data.frame"))


df2 <- structure(list(id = c(12, 123, 10),
                      c1 = c(3, 3, 3), 
                      c2 = c(3, 2, 1),
                      Date = structure(c(3L, 1L, 2L), .Label = c("Friday", "Saturday","Wednesday"),
                                       class = "factor")), row.names = c(NA, -3L), class = "data.frame")


# change days to nummeric (will help you later)
df1 %>% mutate(
  Date_nr_df1=case_when(
    Date=="Monday" ~ 1,
    Date=="Tuesday" ~2,
    Date=="Wednesday" ~3,
    Date=="Thursday" ~4,
    Date=="Friday" ~5,
    Date=="Saturday" ~6,
    Date=="Sunday" ~7)) -> df1

df2 %>% mutate(
  Date_nr_df2=case_when(
    Date=="Monday" ~ 1,
    Date=="Tuesday" ~2,
    Date=="Wednesday" ~3,
    Date=="Thursday" ~4,
    Date=="Friday" ~5,
    Date=="Saturday" ~6,
    Date=="Sunday" ~7)) -> df2

# combine data by the id column
left_join(df1,df2, by=c("id")) -> df     

# adjust data   
df %>%
  group_by(id) %>% # to make changes per row
  mutate(days=paste0(day1,day2,day3,day4,day5,day6,day7)) %>% #pastes the values together
  mutate(days_correct=substring(days,Date_nr_df1)) %>% # applies the start day
  mutate(Occurrence_seq=str_split(days_correct, fixed("0"))[[1]][1]) %>% # extracts all days before 0
  mutate(Occurrence=nchar(Occurrence_seq)) %>%  ## counts these days
  mutate(Occurrence=case_when(Occurrence==1 ~ 0, TRUE ~ as.numeric(Occurrence))) %>% # sets Occurrence to 0 if there is no consecutive occurrence
  mutate(Position=Date_nr_df2-Date_nr_df1+1) %>% ## calculates the position you wanted
  mutate(c1=case_when(Occurrence==0 ~0, TRUE ~ c1),
         c2=case_when(Occurrence==0 ~0, TRUE ~c1),
         Position=case_when(Occurrence==0 ~ 0, TRUE ~ as.numeric(Position))) %>% 
  ungroup() %>% ungroups the df
  select(id,c1,c2,Occurrence,Position) # selects the wanted variables
#> # A tibble: 3 x 5
#>      id    c1    c2 Occurrence Position
#>   <dbl> <dbl> <dbl>      <dbl>    <dbl>
#> 1    12     3     3          7        3
#> 2   123     0     0          0        0
#> 3    10     3     3          2        1

Created on 2020-04-10 by the reprex package (v0.2.1)

like image 57
ava Avatar answered Nov 12 '22 17:11

ava