I am trying to perform a comparison between items in subsequent groups in a dataframe - I guess this is pretty easy when you know what you are doing...
My data set can be represented as follows:
set.seed(1)
data <- data.frame(
date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)
Which yields a dataframe that looks like:
date id
1/02/2015 1008
1/02/2015 1009
1/02/2015 1011
1/02/2015 1015
1/02/2015 1008
1/02/2015 1014
1/02/2015 1015
1/02/2015 1012
1/02/2015 1012
1/02/2015 1006
1/02/2015 1008
1/02/2015 1007
1/02/2015 1012
1/02/2015 1009
1/02/2015 1013
2/02/2015 1010
2/02/2015 1013
2/02/2015 1015
2/02/2015 1009
2/02/2015 1013
2/02/2015 1015
2/02/2015 1008
2/02/2015 1012
2/02/2015 1007
2/02/2015 1008
2/02/2015 1009
2/02/2015 1006
2/02/2015 1009
2/02/2015 1014
2/02/2015 1009
2/02/2015 1010
3/02/2015 1011
3/02/2015 1010
3/02/2015 1007
3/02/2015 1014
3/02/2015 1012
3/02/2015 1013
3/02/2015 1007
3/02/2015 1013
3/02/2015 1010
Then I want to group the data by date (group_by) and then filter out duplicates (distinct) before comparing between the groups. What I want to do is determine from day to day which new id's are added and which id's leave. So day 1 and day 2 would be compared to determine the id's in day 2 that were not in day 1 and the id's that were in day 1 but not present in day 2, then do the same comparisons between day 2 and day 3 etc.
The comparison can be done very easily using an anti_join (dplyr) but I don't know how to reference individual groups in the dataset.
My attempt (or one of my attempts) looks like:
data %>%
group_by(date) %>%
distinct(id) %>%
do(lost = anti_join(., lag(.), by="id"))
But of course this does not work, I just get:
Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)
Is what I am attempting to do even possible or should I be looking at writing a clunky function to do it?
My understanding from the question is that data shows the id's at each date and we want to iterate through the dates comparing the ids in that date to the ids in the immediately prior date.
First get the unique rows u
and convert the id
to numeric. Then split id
by date
giving s
and define a function diffs
which produces a numeric vector of added id's using negative numbers for removed id's. lapply
that to seq_along(s) except for the first component as it has no prior component. No packages are used.
u <- unique(data)
u$id <- as.numeric(as.character(u$id))
s <- split(u$id, u$date)
diffs <- function(i) c(setdiff(s[[i]], s[[i-1]]), - setdiff(s[[i-1]], s[[i]]))
diffs_list <- setNames(lapply(seq_along(s)[-1], diffs), names(s)[-1])
giving:
> diffs_list
$`2015-02-02`
[1] 1010 -1011
$`2015-02-03`
[1] 1011 -1015 -1009 -1008 -1006
or if you want a data frame as output
setNames(stack(diffs_list), c("id", "date"))
giving:
id date
1 1010 2015-02-02
2 -1011 2015-02-02
3 1011 2015-02-03
4 -1015 2015-02-03
5 -1009 2015-02-03
6 -1008 2015-02-03
7 -1006 2015-02-03
magrittr
This could also be expressed using the magrittr package like this where diffs
is defined above.
library(magrittr)
data %>%
unique %>%
transform(id = as.numeric(as.character(id))) %>%
{ split(.$id, .$date) } %>%
{ setNames(lapply(seq_along(.)[-1], diffs), names(.)[-1]) }
Note: I have replaced -3 in data$date
with -03.
I'm sure I don't get to vote for my own answer but I must say that I like mine the best. I was hoping to get an answer that used the dplyr tools to solve the problem so I kept researching and I think I now have a (semi) elegant solution (apart from the for loop in my function).
Generating the sample data set the same way but with more data to make it more interesting:
set.seed(1)
data <- data.frame(
date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)
Searching through the interweb I found the dplyr function 'nest()' which looked to solve all my grouping issues. The nest() function takes the groups created by group_by() and rolls them into a list of data frames so you end up with one entry for each variable you have grouped on and then a data frame for all of the remaining variables that fit into that group - here it is:
dataNested <- data %>%
group_by(date) %>%
distinct(id) %>%
nest()
Which yields a fairly strange dataframe that looks like:
date data
1 2015-02-01 list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
2 2015-02-02 list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
3 2015-02-03 list(id = c(6, 5, 2, 9, 7, 8))
4 2015-02-04 list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
5 2015-02-05 list(id = c(3, 5, 4, 7, 8, 1, 9))
Whereby the indexes in the lists reference a list of the id's (strange but true).
This now allows us to reference the groups by index number viz:
dataNested$data[[2]]
returns:
# A tibble: 9 × 1
id
<fctr>
1 1010
2 1013
3 1015
4 1009
5 1008
6 1012
7 1007
8 1006
From here it's a simple matter of writing a function that will do the anti_join to leave us with just the differences between each subsequent group (though this is the part I'm not proud of and really starts to show my lack of R skills - please feel free to suggest improvements):
## Function departed() - returns the id's that were dropped from each subsequent time period
departed <- function(groups) {
tempList <- vector("list", nrow(groups))
# Loop through the groups and do an anti_join between each
for (i in seq(1, nrow(groups) - 1)) {
tempList[[i + 1]] <-
anti_join(data.frame(groups$data[[i]]), data.frame(groups$data[[i + 1]]), by = "id")
}
return(tempList)
}
Applying this function to our nested data yields the list of lists of departed id's:
> departedIDs <- dataNested %>% departed()
> departedIDs
[[1]]
NULL
[[2]]
id
1 1011
[[3]]
id
1 1006
2 1008
3 1009
4 1015
[[4]]
id
1 1007
[[5]]
id
1 1011
2 1015
I hope this answer will help others who's brain works the same way as mine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With