Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr - right join after group_by not producing desired/expected result

Tags:

r

dplyr

I am trying to get each of my id/year/month rows to have all rows corresponding to all seven weekdays with NAs for 'missing weekdays.'

Here is the data frame and my attempt at achieving this task:

> df
  id year month weekday  amount
1  1 2015     1  Friday 3650.43
2  2 2015     1  Monday 1271.12
3  1 2015     2  Friday 1315.79
4  2 2015     2  Monday 2195.37
> wday
    weekday
1    Friday
2  Saturday
3 Wednesday
4    Sunday
5   Tuesday
6    Monday
7  Thursday

Tried to use group_by() and the right join. But, it is not producing what I thought it would. Is there a simple way to achieve the result I am after?

> df <- df %>% group_by(id, year, month) %>% right_join(wday)
Joining by: "weekday"
> df
Source: local data frame [9 x 5]
Groups: id, year, month [?]

     id  year month   weekday  amount
  (dbl) (int) (int)     (chr)   (dbl)
1     1  2015     1    Friday 3650.43
2     1  2015     2    Friday 1315.79
3    NA    NA    NA  Saturday      NA
4    NA    NA    NA Wednesday      NA
5    NA    NA    NA    Sunday      NA
6    NA    NA    NA   Tuesday      NA
7     2  2015     1    Monday 1271.12
8     2  2015     2    Monday 2195.37
9    NA    NA    NA  Thursday      NA

I want 7 rows per id/year/month combination where amount for missing weekdays will be NA (or zeroes ideally, but I know how to get that by mutate()).

Resulting data frame should look like this:

> df
   id year month   weekday  amount
1   1 2015     1    Friday 3650.43
2   1 2015     1    Monday    0.00
3   1 2015     1  Saturday    0.00
4   1 2015     1    Sunday    0.00
5   1 2015     1  Thursday    0.00
6   1 2015     1   Tuesday    0.00
7   1 2015     1 Wednesday    0.00
8   1 2015     2    Friday 1315.79
9   1 2015     2    Monday    0.00
10  1 2015     2  Saturday    0.00
11  1 2015     2    Sunday    0.00
12  1 2015     2  Thursday    0.00
13  1 2015     2   Tuesday    0.00
14  1 2015     2 Wednesday    0.00
15  2 2015     1    Friday    0.00
16  2 2015     1    Monday 1271.12
17  2 2015     1  Saturday    0.00
18  2 2015     1    Sunday    0.00
19  2 2015     1  Thursday    0.00
20  2 2015     1   Tuesday    0.00
21  2 2015     1 Wednesday    0.00
22  2 2015     2    Friday    0.00
23  2 2015     2    Monday 2195.37
24  2 2015     2  Saturday    0.00
25  2 2015     2    Sunday    0.00
26  2 2015     2  Thursday    0.00
27  2 2015     2   Tuesday    0.00
28  2 2015     2 Wednesday    0.00
like image 819
Gopala Avatar asked Dec 20 '15 15:12

Gopala


3 Answers

Using tidyr and dplyr. complete here does the heavy lifting - if you already have each weekday somewhere in df, you won't need the bind_rows or na.omit (or dplyr).

library(dplyr)
library(tidyr)
df %>% #initial data
    bind_rows(wday) %>% #adding on so we have all the weekdays
    complete(id, year, month, weekday,  #completing all levels of id:year:month:weekday
                fill = list(amount = 0)) %>% #filling amount column with 0
    na.omit() #remove the NAs we got from the bind_rows
like image 193
jeremycg Avatar answered Nov 20 '22 15:11

jeremycg


We can use expand.grid

expand.grid(c(lapply(df[1:3], unique), wday['weekday'])) %>% 
       left_join(., df) %>%
       mutate(amount=replace(amount, is.na(amount), 0)) %>% 
       arrange(id, year, month, weekday)
#    id year month   weekday  amount
#1   1 2015     1    Friday 3650.43
#2   1 2015     1    Monday    0.00
#3   1 2015     1  Saturday    0.00
#4   1 2015     1    Sunday    0.00
#5   1 2015     1  Thursday    0.00
#6   1 2015     1   Tuesday    0.00
#7   1 2015     1 Wednesday    0.00
#8   1 2015     2    Friday 1315.79
#9   1 2015     2    Monday    0.00
#10  1 2015     2  Saturday    0.00
#11  1 2015     2    Sunday    0.00
#12  1 2015     2  Thursday    0.00
#13  1 2015     2   Tuesday    0.00
#14  1 2015     2 Wednesday    0.00
#15  2 2015     1    Friday    0.00
#16  2 2015     1    Monday 1271.12
#17  2 2015     1  Saturday    0.00
#18  2 2015     1    Sunday    0.00
#19  2 2015     1  Thursday    0.00
#20  2 2015     1   Tuesday    0.00
#21  2 2015     1 Wednesday    0.00
#22  2 2015     2    Friday    0.00
#23  2 2015     2    Monday 2195.37
#24  2 2015     2  Saturday    0.00
#25  2 2015     2    Sunday    0.00
#26  2 2015     2  Thursday    0.00
#27  2 2015     2   Tuesday    0.00
#28  2 2015     2 Wednesday    0.00
like image 28
akrun Avatar answered Nov 20 '22 16:11

akrun


sqldf For complex joins it is usually easier to use SQL:

library(sqldf)
sqldf("select 
         id, 
         year, 
         month, 
         wday.weekday, 
         sum((df.weekday = wday.weekday) * amount) amount 
       from df 
       join wday
       group by 1, 2, 3, 4")

giving:

   id year month   weekday  amount
1   1 2015     1    Friday 3650.43
2   1 2015     1  Saturday    0.00
3   1 2015     1 Wednesday    0.00
4   1 2015     1    Sunday    0.00
5   1 2015     1   Tuesday    0.00
6   1 2015     1    Monday    0.00
7   1 2015     1  Thursday    0.00
8   2 2015     1    Friday    0.00
9   2 2015     1  Saturday    0.00
10  2 2015     1 Wednesday    0.00
11  2 2015     1    Sunday    0.00
12  2 2015     1   Tuesday    0.00
13  2 2015     1    Monday 1271.12
14  2 2015     1  Thursday    0.00
15  1 2015     2    Friday 1315.79
16  1 2015     2  Saturday    0.00
17  1 2015     2 Wednesday    0.00
18  1 2015     2    Sunday    0.00
19  1 2015     2   Tuesday    0.00
20  1 2015     2    Monday    0.00
21  1 2015     2  Thursday    0.00
22  2 2015     2    Friday    0.00
23  2 2015     2  Saturday    0.00
24  2 2015     2 Wednesday    0.00
25  2 2015     2    Sunday    0.00
26  2 2015     2   Tuesday    0.00
27  2 2015     2    Monday 2195.37
28  2 2015     2  Thursday    0.00

base R We could replicate this in base R using merge and transform:

xt <- transform(
  merge(df, wday, by = c()),
  amount = (as.character(weekday.x) == as.character(weekday.y)) * amount, 
  weekday = weekday.y, 
  weekday.x = NULL, 
  weekday.y = NULL
))
aggregate(amount ~., xt, sum)

dplyr and if we really wanted to use dplyr we could replace the transform with mutate, rename and select:

library(dplyr)
merge(df, wday, by = c()) %>% 
 mutate(amount = (as.character(weekday.x) == as.character(weekday.y)) * amount) %>%
 rename(weekday = weekday.y) %>%
 select(-weekday.x) %>%
 group_by(id, year, month, weekday) %>%
 summarise(amount = sum(amount))

Note: If there is only one weekday per group (as in the question) we could optionally omit group by/sum, aggregate and group_by/summarise in the three solutions respectively.

like image 44
G. Grothendieck Avatar answered Nov 20 '22 14:11

G. Grothendieck