Find discrepancies between two tables

Tags:

I'm working with R from a SAS/SQL background, and am trying to write code to take two tables, compare them, and provide a list of the discrepancies. This code would be used repeatedly for many different sets of tables, so I need to avoid hardcoding.

I'm working with Identifying specific differences between two data sets in R , but it doesn't get me all the way there.

Example Data, using the combination of LastName/FirstName (which is unique) as a key --

Dataset One --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4321 Tower St    54321   10
Don        Bob         771  North Ave   23232   5
Smith      Mike        732 South Blvd.  77777   3        

Dataset Two --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4111 Tower St    32132   17
Donn       Bob         771  North Ave   11111   5

   Desired Output --

   LastName FirstName VarName         TableOne        TableTwo
   Doe      Jane      StreetAddress   4321 Tower St   4111 Tower St 
   Doe      Jane      Zip             23232           32132
   Doe      Jane      VisitCount      5               17

Note that this output ignores records where I don't have the same ID in both tables (for instance, because Bob's last name is "Don" in one table, and "Donn" in another table, we ignore that record entirely).

I've explored doing this by applying the melt function on both datasets, and then comparing them, but the size data I'm working with indicates that wouldn't be practical. In SAS, I used Proc Compare for this kind of work, but I haven't found an exact equivalent in R.

996

asked Jan 20 '15 23:01

Netbrian

1 Answers

Here is a solution based on data.table:

library(data.table)

# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)

# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]

#     Last_Name First_Name        VarName      TableOne      TableTwo
#     1:       Doe       Jane Street_Address 4321 Tower St 4111 Tower St
#     2:       Doe       Jane            ZIP         54321         32132
#     3:       Doe       Jane     VisitCount            10            17

where input data sets are:

# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", 
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", 
"771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", 
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", 
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", 
"771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", 
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))

104

answered Oct 20 '22 13:10

Marat Talipov

Related questions
                            
                                Why does parLapplyLB not actually balance load?
                            
                                How can I reduce row height in DT datatables
                            
                                Using dplyr + gsub on many columns
                            
                                How to find an intersection of curve and circle?
                            
                                Using R to solve the Lucky 26 game
                            
                                decode tinyurl in R to get full url path?
                            
                                What does the symbol ::: mean in R
                            
                                How can I persuade ggplot2 geom_text to label a specified date in a time series plot?
                            
                                Getting index of first occurrence of a value in every column of a matrix
                            
                                How to plot family tree in R
                            
                                Is mclapply guaranteed to return its results in order?
                            
                                how to convert a data.frame to tree structure object such as dendrogram
                            
                                r - file.choose() customizing dialogue window
                            
                                In R XML Package, what is the difference between xmlParse and xmlTreeParse?
                            
                                How to control the background color of the first slidify slide?
                            
                                ggplot: colour points by groups based on user defined colours
                            
                                When to use missing versus NULL values for passing undefined function arguments in R, and why?
                            
                                Modifying an R package function for current R session; assignInNamespace not behaving like fixInNamespace?
                            
                                warning messages when trying to run glmer in r
                            
                                Filled and hollow shapes where the fill color = the line color

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find discrepancies between two tables

Tags:

dataframe

r

Netbrian

People also ask

1 Answers

Marat Talipov

Recent Activity

Donate For Us