Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing between groups in grouped dataframe

I am trying to perform a comparison between items in subsequent groups in a dataframe - I guess this is pretty easy when you know what you are doing...

My data set can be represented as follows:

set.seed(1)
data <- data.frame(
 date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
 id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

Which yields a dataframe that looks like:

date    id
1/02/2015   1008
1/02/2015   1009
1/02/2015   1011
1/02/2015   1015
1/02/2015   1008
1/02/2015   1014
1/02/2015   1015
1/02/2015   1012
1/02/2015   1012
1/02/2015   1006
1/02/2015   1008
1/02/2015   1007
1/02/2015   1012
1/02/2015   1009
1/02/2015   1013
2/02/2015   1010
2/02/2015   1013
2/02/2015   1015
2/02/2015   1009
2/02/2015   1013
2/02/2015   1015
2/02/2015   1008
2/02/2015   1012
2/02/2015   1007
2/02/2015   1008
2/02/2015   1009
2/02/2015   1006
2/02/2015   1009
2/02/2015   1014
2/02/2015   1009
2/02/2015   1010
3/02/2015   1011
3/02/2015   1010
3/02/2015   1007
3/02/2015   1014
3/02/2015   1012
3/02/2015   1013
3/02/2015   1007
3/02/2015   1013
3/02/2015   1010

Then I want to group the data by date (group_by) and then filter out duplicates (distinct) before comparing between the groups. What I want to do is determine from day to day which new id's are added and which id's leave. So day 1 and day 2 would be compared to determine the id's in day 2 that were not in day 1 and the id's that were in day 1 but not present in day 2, then do the same comparisons between day 2 and day 3 etc.
The comparison can be done very easily using an anti_join (dplyr) but I don't know how to reference individual groups in the dataset.

My attempt (or one of my attempts) looks like:

data %>%
  group_by(date) %>%
  distinct(id) %>%
  do(lost = anti_join(., lag(.), by="id"))

But of course this does not work, I just get:

Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)

Is what I am attempting to do even possible or should I be looking at writing a clunky function to do it?

like image 989
CrustyNoodle Avatar asked Aug 20 '17 06:08

CrustyNoodle


2 Answers

My understanding from the question is that data shows the id's at each date and we want to iterate through the dates comparing the ids in that date to the ids in the immediately prior date.

First get the unique rows u and convert the id to numeric. Then split id by date giving s and define a function diffs which produces a numeric vector of added id's using negative numbers for removed id's. lapply that to seq_along(s) except for the first component as it has no prior component. No packages are used.

u <- unique(data)
u$id <- as.numeric(as.character(u$id))
s <- split(u$id, u$date)
diffs <- function(i) c(setdiff(s[[i]], s[[i-1]]), - setdiff(s[[i-1]], s[[i]]))
diffs_list <- setNames(lapply(seq_along(s)[-1], diffs), names(s)[-1])

giving:

> diffs_list
$`2015-02-02`
[1]  1010 -1011

$`2015-02-03`
[1]  1011 -1015 -1009 -1008 -1006

or if you want a data frame as output

setNames(stack(diffs_list), c("id", "date"))

giving:

     id       date
1  1010 2015-02-02
2 -1011 2015-02-02
3  1011 2015-02-03
4 -1015 2015-02-03
5 -1009 2015-02-03
6 -1008 2015-02-03
7 -1006 2015-02-03

magrittr

This could also be expressed using the magrittr package like this where diffs is defined above.

library(magrittr)

data %>%
     unique %>%
     transform(id = as.numeric(as.character(id))) %>%
     { split(.$id, .$date) } %>%
     { setNames(lapply(seq_along(.)[-1], diffs), names(.)[-1]) }

Note: I have replaced -3 in data$date with -03.

like image 143
G. Grothendieck Avatar answered Oct 22 '22 10:10

G. Grothendieck


I'm sure I don't get to vote for my own answer but I must say that I like mine the best. I was hoping to get an answer that used the dplyr tools to solve the problem so I kept researching and I think I now have a (semi) elegant solution (apart from the for loop in my function).

Generating the sample data set the same way but with more data to make it more interesting:

set.seed(1)
data <- data.frame(
  date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
  id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

Searching through the interweb I found the dplyr function 'nest()' which looked to solve all my grouping issues. The nest() function takes the groups created by group_by() and rolls them into a list of data frames so you end up with one entry for each variable you have grouped on and then a data frame for all of the remaining variables that fit into that group - here it is:

dataNested <- data %>%
  group_by(date) %>%
  distinct(id) %>%
  nest()

Which yields a fairly strange dataframe that looks like:

     date          data
1    2015-02-01    list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
2    2015-02-02    list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
3    2015-02-03    list(id = c(6, 5, 2, 9, 7, 8))
4    2015-02-04    list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
5    2015-02-05    list(id = c(3, 5, 4, 7, 8, 1, 9))

Whereby the indexes in the lists reference a list of the id's (strange but true).

This now allows us to reference the groups by index number viz:

dataNested$data[[2]]

returns:

# A tibble: 9 × 1
      id
  <fctr>
1   1010
2   1013
3   1015
4   1009
5   1008
6   1012
7   1007
8   1006

From here it's a simple matter of writing a function that will do the anti_join to leave us with just the differences between each subsequent group (though this is the part I'm not proud of and really starts to show my lack of R skills - please feel free to suggest improvements):

## Function departed() - returns the id's that were dropped from each subsequent time period
departed <- function(groups) {
  tempList <- vector("list", nrow(groups))
  # Loop through the groups and do an anti_join between each
  for (i in seq(1, nrow(groups) - 1)) {
  tempList[[i + 1]] <-
  anti_join(data.frame(groups$data[[i]]),  data.frame(groups$data[[i + 1]]), by = "id")

  }
  return(tempList)
}

Applying this function to our nested data yields the list of lists of departed id's:

> departedIDs <- dataNested %>% departed()

> departedIDs
[[1]]
NULL

[[2]]
    id
1 1011

[[3]]
    id
1 1006
2 1008
3 1009
4 1015

[[4]]
    id
1 1007

[[5]]
    id
1 1011
2 1015

I hope this answer will help others who's brain works the same way as mine.

like image 2
CrustyNoodle Avatar answered Oct 22 '22 11:10

CrustyNoodle