I would like to compare two data sets and identify specific instances of discrepancies between them (i.e., which variables were different).
While I have found out how to identify which records are not identical between the two data sets (using the function detailed here: http://www.cookbook-r.com/Manipulating_data/Comparing_data_frames/), I'm not sure how to flag which variables are different.
E.g.
Data set A:
id name dob vaccinedate vaccinename dose
100000 John Doe 1/1/2000 5/20/2012 MMR 4
100001 Jane Doe 7/3/2011 3/14/2013 VARICELLA 1
Data set B:
id name dob vaccinedate vaccinename dose
100000 John Doe 1/1/2000 5/20/2012 MMR 3
100001 Jane Doee 7/3/2011 3/24/2013 VARICELLA 1
100002 John Smith 2/5/2010 7/13/2013 HEPB 3
I want to identify which records are different, and which specific variable(s) have discrepancies. For example, the John Doe record has 1 discrepancy in dose
, and the Jane Doe record has 2 discrepancies: in name
and vaccinedate
. Also, data set B has one additional record that was not in data set A, and I would want to identify these instances as well.
In the end, the goal is to find the frequency of the "types" of errors, e.g. how many records have a discrepancy in vaccinedate, vaccinename, dose, etc.
Thanks!
One possibility. First, find out which ids both datasets have in common. The simplest way to do this is:
commonID<-intersect(A$id,B$id)
Then you can determine which rows are missing from A by:
> B[!B$id %in% commonID,]
# id name dob vaccinedate vaccinename dose
# 3 100002 John Smith 2/5/2010 7/13/2013 HEPB 3
Next, you can restrict both datasets to the ids they have in common.
Acommon<-A[A$id %in% commonID,]
Bcommon<-B[B$id %in% commonID,]
If you can't assume that the id's are in the right order, then sort them both:
Acommon<-Acommon[order(Acommon$id),]
Bcommon<-Bcommon[order(Bcommon$id),]
Now you can see what fields are different like this.
diffs<-Acommon != Bcommon
diffs
# id name dob vaccinedate vaccinename dose
# 1 FALSE FALSE FALSE FALSE FALSE TRUE
# 2 FALSE TRUE FALSE TRUE FALSE FALSE
This is a logical matrix, and you can do whatever you want with it. For example, to find the total number of errors in each column:
colSums(diffs)
# id name dob vaccinedate vaccinename dose
# 0 1 0 1 0 1
To find all ids where the name is different:
Acommon$id[diffs[,"name"]]
# [1] 100001
And so on.
This should get you started, but there may be more elegant solutions.
First, establish df1
and df2
so others can reproduce quickly:
df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
Next, get the discrepancies from df1
to df2
via mapply
and setdiff
. That is, what's in set one that's not in set two:
discrep <- mapply(setdiff, df1, df2)
discrep
# $id
# integer(0)
#
# $name
# [1] "Jane Doe"
#
# $dob
# character(0)
#
# $vaccinedate
# [1] "3/14/2013"
#
# $vaccinename
# character(0)
#
# $dose
# [1] 4
To count them up we can use sapply
:
num.discrep <- sapply(discrep, length)
num.discrep
# id name dob vaccinedate vaccinename dose
# 0 1 0 1 0 1
Per your question on obtaining id's in set two that are not in set one, you could reverse the process with mapply(setdiff, df2, df1)
or if it's simply an exercise of ids
only you could do setdiff(df2$id, df1$id)
.
For more on R's functional functions (e.g., mapply, sapply, lapply, etc.) see this post.
Updating with a purrr
solution:
map2(df1, df2, setdiff) %>%
map_int(length)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With