Idea is to compare two string vectors such as:
df <- data.frame(a = c("New York 001", "Orlando 002", "Boston 003", "Chicago 004", "Atlanta 005"),
b = c("NEW YORK 001", "Orlando", "Boston (003)", "Chicago 005", "005 Atlanta"))
And come up with a way to give them some measure of precision. Basically adding column c that has a numeric value.
My train of thought:
We have this:
> df
a b
1 New York 001 NEW YORK 001
2 Orlando 002 Orlando
3 Boston 003 Boston (003)
4 Chicago 004 Chicago 005
5 Atlanta 005 005 Atlanta
First thing's first - strip whites, ignore cases and remove all special characters while we're at it.
df$a <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$a))
df$b <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$b))
What we get:
> df
a b
1 NEWYORK001 NEWYORK001
2 ORLANDO002 ORLANDO
3 BOSTON003 BOSTON003
4 CHICAGO004 CHICAGO005
5 ATLANTA005 005ATLANTA
So now we're at the core of the problem.
First line would be 100% match. Second line has 7 matching characters out of maximum 10 in col a. Hence 70%. Third now matches at 100%. Fourth has 90% match. Fifth one is tricky. Human mind tells me they match, but there's problem with the order. But that's not how the computer works. Realistically it can be measured as 70% match because 7 consecutive characters repeat in both strings.
So the question is:
How to make this quantitative measure of string comparison?
Perhaps there is a better way to do this, since I've never had an experience with comparing string sets on partial match. And comming up with this particular quantifiable measure is just my intuitive way of doing things. I wouldn't be surprised if R already had a library/function that does all this in a better way that I'm simply not aware of.
I've arrived to fairly easy answer to my own question. And it is Levenshtein distance. Or adist()
in R.
Long story short:
df$c <- 1 - diag(adist(df$a, df$b, fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)
This does the trick.
> df
a b c
1 NEWYORK001 NEWYORK001 1.0
2 ORLANDO002 ORLANDO 0.7
3 BOSTON003 BOSTON003 1.0
4 CHICAGO004 CHICAGO005 0.9
5 ATLANTA005 005ATLANTA 0.7
Update:
Running the function on one of my data sets returns cute result (that made my inner nerd chuckle a bit):
Error: cannot allocate vector of size 1650.7 Gb
So, I guess it's another apply()
loop for adist()
, taking diagonal of the whole matrix is... well, fairly inefficient.
df$c <- 1 - apply(cbind(df$a, df$b),1, function(x) adist(x[1], x[2], fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)
This modification yields very satisfying results.
A more correct answer with Rcpp:
library(Rcpp)
cppFunction('NumericVector commonChars(CharacterVector x, CharacterVector y) {
int len = x.size();
NumericVector out(len);
double percentage;
int count=0,k=0;
std::string compared;
std::string source;
for (int i=0; i<len;++i) {
source = x[i];
compared = y[i];
count=0;
k=0;
for (int j=0;j<compared.length();j++) {
if (source[j] == compared[j]) { count++; continue; }
while(k < source.length()) {
if (source[j] == compared[k]) { count++; break; }
k++;
}
}
percentage = (count+0.0)/(source.length()+0.0);
out[i] = percentage;
}
return out;
}')
Giving:
> commonChars(df$a,df$b)
[1] 1.0 0.7 1.0 0.9 0.7
I didn't bench it against other answers nor with large dataframe.
Not really what you're wishing but here's an idea (I'll try to improve it):
df$r <- gsub("\\w","(\1)?",df$a)
for (i in 1:length(df$a)) {
df$percentage[i] < ( as.integer(
attr(
regexpr( df$r[i], df$b[i]),
"match.length"
)
) / str_length(df$a[i]) * 100)
}
Output:
a b r percentage
1 NEWYORK001 NEWYORK001 (N)?(E)?(W)?(Y)?(O)?(R)?(K)?(0)?(0)?(1)? 100
2 ORLANDO002 ORLANDO (O)?(R)?(L)?(A)?(N)?(D)?(O)?(0)?(0)?(2)? 70
3 BOSTON003 BOSTON003 (B)?(O)?(S)?(T)?(O)?(N)?(0)?(0)?(3)? 100
4 CHICAGO004 CHICAGO005 (C)?(H)?(I)?(C)?(A)?(G)?(O)?(0)?(0)?(4)? 90
5 ATLANTA005 005ATLANTA (A)?(T)?(L)?(A)?(N)?(T)?(A)?(0)?(0)?(5)? 30
Drawbacks:
ATLANTA005
return 30% because of the 005 matching only in the order.I'll see if I can find a way to build a better regexp
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With