Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find discrepancies between two tables

Tags:

dataframe

r

I'm working with R from a SAS/SQL background, and am trying to write code to take two tables, compare them, and provide a list of the discrepancies. This code would be used repeatedly for many different sets of tables, so I need to avoid hardcoding.

I'm working with Identifying specific differences between two data sets in R , but it doesn't get me all the way there.

Example Data, using the combination of LastName/FirstName (which is unique) as a key --

Dataset One --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4321 Tower St    54321   10
Don        Bob         771  North Ave   23232   5
Smith      Mike        732 South Blvd.  77777   3        

Dataset Two --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4111 Tower St    32132   17
Donn       Bob         771  North Ave   11111   5

   Desired Output --

   LastName FirstName VarName         TableOne        TableTwo
   Doe      Jane      StreetAddress   4321 Tower St   4111 Tower St 
   Doe      Jane      Zip             23232           32132
   Doe      Jane      VisitCount      5               17

Note that this output ignores records where I don't have the same ID in both tables (for instance, because Bob's last name is "Don" in one table, and "Donn" in another table, we ignore that record entirely).

I've explored doing this by applying the melt function on both datasets, and then comparing them, but the size data I'm working with indicates that wouldn't be practical. In SAS, I used Proc Compare for this kind of work, but I haven't found an exact equivalent in R.

like image 996
Netbrian Avatar asked Jan 20 '15 23:01

Netbrian


People also ask

How do I compare two tables in SQL to find unmatched records?

Use the Find Unmatched Query Wizard to compare two tables One the Create tab, in the Queries group, click Query Wizard. In the New Query dialog box, double-click Find Unmatched Query Wizard. On the first page of the wizard, select the table that has unmatched records, and then click Next.

How can I find the difference between two tables in SQL Server?

Comparing the Results of the Two Queries Let us suppose, we have two tables: table1 and table2. Here, we will use UNION ALL to combine the records based on columns that need to compare. If the values in the columns that need to compare are the same, the COUNT(*) returns 2, otherwise the COUNT(*) returns 1.


1 Answers

Here is a solution based on data.table:

library(data.table)

# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)

# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]

#     Last_Name First_Name        VarName      TableOne      TableTwo
#     1:       Doe       Jane Street_Address 4321 Tower St 4111 Tower St
#     2:       Doe       Jane            ZIP         54321         32132
#     3:       Doe       Jane     VisitCount            10            17

where input data sets are:

# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", 
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", 
"771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", 
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", 
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", 
"771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", 
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
like image 104
Marat Talipov Avatar answered Oct 20 '22 13:10

Marat Talipov