Techniques for finding near duplicate records

Tags:

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".

My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.

My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)

I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.

I have a few related questions:

Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be done directly in the database? (It's an Access database, so I'd rather avoid touching it if possible.)

215

asked Jul 13 '11 17:07

Richie Cotton

1 Answers

If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.

I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).

If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.

So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.

You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.

135

answered Sep 24 '22 10:09

Richard Herron

Related questions
                            
                                Looping over variables in ggplot
                            
                                How to remove "rows" with a NA value? [duplicate]
                            
                                ggplot2: histogram with normal curve
                            
                                In R base plot, move axis label closer to axis
                            
                                Stacked bar chart
                            
                                Error - replacement has [x] rows, data has [y]
                            
                                How to make a sunburst plot in R or Python?
                            
                                Remove rows in R matrix where all data is NA [duplicate]
                            
                                Change background color of R plot
                            
                                Find the index position of the first non-NA value in an R vector?
                            
                                Export data from R to Excel
                            
                                assign headers based on existing row in dataframe in R
                            
                                State name to abbreviation
                            
                                How to drop columns by name pattern in R?
                            
                                How to extract Month from date in R
                            
                                suppress NAs in paste()
                            
                                How can I drop unused levels from a data frame?
                            
                                Figure position in markdown when converting to PDF with knitr and pandoc
                            
                                rCharts nvd3 lineWithFocusChart Customization
                            
                                Is there a way to run R code from JavaScript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Techniques for finding near duplicate records

Tags:

r

duplicate-removal

duplicate-data

fuzzy-comparison

Richie Cotton

People also ask

1 Answers

Richard Herron

Recent Activity

Donate For Us