Say I have two columns of names. All names in the first column are in the second column, but in a random order, AND some of them are not perfect matches. So maybe in one column theres the name John Smith and in the second John_smith or JonSmith. Is there any fairly simple R way of performing a "best match"?
Given some data like this:
df<-data.frame(x=c('john doe','john smith','sally struthers'),y=c('John Smith','John_smith','JonSmith'))
You can get a long way with a few gsub
s and tolower
:
df$y.fix <- gsub('[[:punct:]]', ' ', df$y)
df$y.fix <- gsub(' ', '', df$y.fix)
df$y.fix <- tolower(df$y.fix)
df$x.fix <- tolower(gsub(' ', '', df$x))
Then agrep
is what you'll want:
> agrep(df$x.fix[2], df$y.fix)
[1] 1 2 3
for more complex confusing strings, see this post from last week.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With