Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two Data Frames using Fuzzy/Approximate String Matching in R

Tags:

DESCRIPTION

I have two datasets with information that I need to merge. The only common fields that I have are strings that do not perfectly match and a numerical field that can be substantially different

The only way to explain the problem is to show you the data. Here is a.csv and b.csv. I am trying to merge B to A.

There are three fields in B and four in A. Company Name (File A Only), Fund Name, Asset Class, and Assets. So far, my focus has been on attempting to match the Fund Names by replacing words or parts of the strings to create exact matches and then using:

a <- read.table(file = "http://bertelsen.ca/R/a.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T)  b <- read.table(file = "http://bertelsen.ca/R/b.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T) merge(a,b, by="Fund.Name")  

However, this only brings me to about 30% matching. The rest I have to do by hand.

Assets is a numerical field that is not always correct in either and can vary wildly if the fund has low assets. Asset Class is a string field that is "generally" the same in both files, however, there are discrepancies.

Adding to the complication are the different series of funds, in File B. For example:

AGF Canadian Value

AGF Canadian Value-D

In these cases, I have to choose the one that is not seried, or choose the one that is called "A", "-A", or "Advisor" as the match.

QUESTION

What would you say is the best approach? This excercise is something that I have to do on a monthly basis and matching them manually is incredibly time consuming. Examples of code would be instrumental.

IDEAS

One method that I think may work is normalizing the strings based on the first capitalized letter of each word in the string. But I haven't been able to figure out how to pull that off using R.

Another method I considered was creating an index of matches based on a combination of assets, fund name, asset class and company. But again, I'm not sure how to do this with R. Or, for that matter, if it's even possible.

Examples of code, comments, thoughts and direction are greatly appreciated!

like image 387
Brandon Bertelsen Avatar asked Feb 09 '10 19:02

Brandon Bertelsen


People also ask

How do you match fuzzy strings in R?

Often you may want to join together two datasets in R based on imperfectly matching strings. This is sometimes called fuzzy matching. The easiest way to perform fuzzy matching in R is to use the stringdist_join() function from the fuzzyjoin package.

How do I merge two data frames in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens.

What is fuzzy data matching?

What is Fuzzy Matching? Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same.


2 Answers

It's highly recommended to use the dgrtwo/fuzzyjoin package. stringdist_inner_join(a,b, by="Fund.Name")

like image 54
crestor Avatar answered Oct 12 '22 14:10

crestor


One quick suggestion: try to do some matching on the different fields separately before using merge. The simplest approach is with the pmatch function, although R has no shortage of text matching functions (e.g. agrep). Here's a simple example:

pmatch(c("med", "mod"), c("mean", "median", "mode")) 

For your dataset, this matches all the fund names out of a:

> nrow(merge(a,b,x.by="Fund.Name", y.by="Fund.name")) [1] 58 > length(which(!is.na(pmatch(a$Fund.Name, b$Fund.name)))) [1] 238 

Once you create matches, you can easily merge them together using those instead.

like image 30
Shane Avatar answered Oct 12 '22 13:10

Shane