Imperfect String Matching

Question

Say I have two columns of names. All names in the first column are in the second column, but in a random order, AND some of them are not perfect matches. So maybe in one column theres the name John Smith and in the second John_smith or JonSmith. Is there any fairly simple R way of performing a "best match"?

Justin · Accepted Answer

Given some data like this:

df<-data.frame(x=c('john doe','john smith','sally struthers'),y=c('John Smith','John_smith','JonSmith'))

You can get a long way with a few gsubs and tolower:

df$y.fix <- gsub('[[:punct:]]', ' ', df$y)
df$y.fix <- gsub(' ', '', df$y.fix)
df$y.fix <- tolower(df$y.fix)
df$x.fix <- tolower(gsub(' ', '', df$x))

Then agrep is what you'll want:

> agrep(df$x.fix[2], df$y.fix)
[1] 1 2 3

for more complex confusing strings, see this post from last week.

Imperfect String Matching

Tags:

r

JoshDG

1 Answers

Justin

Recent Activity

Donate For Us

Imperfect String Matching

Tags:

r

JoshDG

1 Answers

Justin

Related questions

Recent Activity

Donate For Us