Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Imperfect String Matching

Tags:

r

Say I have two columns of names. All names in the first column are in the second column, but in a random order, AND some of them are not perfect matches. So maybe in one column theres the name John Smith and in the second John_smith or JonSmith. Is there any fairly simple R way of performing a "best match"?

like image 569
JoshDG Avatar asked Feb 08 '12 15:02

JoshDG


1 Answers

Given some data like this:

df<-data.frame(x=c('john doe','john smith','sally struthers'),y=c('John Smith','John_smith','JonSmith'))

You can get a long way with a few gsubs and tolower:

df$y.fix <- gsub('[[:punct:]]', ' ', df$y)
df$y.fix <- gsub(' ', '', df$y.fix)
df$y.fix <- tolower(df$y.fix)
df$x.fix <- tolower(gsub(' ', '', df$x))

Then agrep is what you'll want:

> agrep(df$x.fix[2], df$y.fix)
[1] 1 2 3

for more complex confusing strings, see this post from last week.

like image 63
Justin Avatar answered Oct 01 '22 14:10

Justin