Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing string vectors and quantifying differences

Tags:

string

regex

r

Idea is to compare two string vectors such as:

df <- data.frame(a = c("New York 001", "Orlando 002", "Boston 003", "Chicago 004", "Atlanta 005"),
                 b = c("NEW YORK  001", "Orlando", "Boston (003)", "Chicago 005", "005 Atlanta"))

And come up with a way to give them some measure of precision. Basically adding column c that has a numeric value.

My train of thought:

We have this:

> df
             a             b
1 New York 001 NEW YORK  001
2  Orlando 002       Orlando
3   Boston 003  Boston (003)
4  Chicago 004   Chicago 005
5  Atlanta 005   005 Atlanta

First thing's first - strip whites, ignore cases and remove all special characters while we're at it.

df$a <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$a))
df$b <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$b))

What we get:

> df
           a          b
1 NEWYORK001 NEWYORK001
2 ORLANDO002    ORLANDO
3  BOSTON003  BOSTON003
4 CHICAGO004 CHICAGO005
5 ATLANTA005 005ATLANTA

So now we're at the core of the problem.

First line would be 100% match. Second line has 7 matching characters out of maximum 10 in col a. Hence 70%. Third now matches at 100%. Fourth has 90% match. Fifth one is tricky. Human mind tells me they match, but there's problem with the order. But that's not how the computer works. Realistically it can be measured as 70% match because 7 consecutive characters repeat in both strings.

So the question is:

How to make this quantitative measure of string comparison?

Perhaps there is a better way to do this, since I've never had an experience with comparing string sets on partial match. And comming up with this particular quantifiable measure is just my intuitive way of doing things. I wouldn't be surprised if R already had a library/function that does all this in a better way that I'm simply not aware of.

like image 770
statespace Avatar asked Dec 01 '22 14:12

statespace


2 Answers

I've arrived to fairly easy answer to my own question. And it is Levenshtein distance. Or adist() in R.

Long story short:

df$c <- 1 - diag(adist(df$a, df$b, fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)

This does the trick.

> df
           a          b   c
1 NEWYORK001 NEWYORK001 1.0
2 ORLANDO002    ORLANDO 0.7
3  BOSTON003  BOSTON003 1.0
4 CHICAGO004 CHICAGO005 0.9
5 ATLANTA005 005ATLANTA 0.7

Update:

Running the function on one of my data sets returns cute result (that made my inner nerd chuckle a bit):

Error: cannot allocate vector of size 1650.7 Gb

So, I guess it's another apply() loop for adist(), taking diagonal of the whole matrix is... well, fairly inefficient.

df$c <- 1 - apply(cbind(df$a, df$b),1, function(x) adist(x[1], x[2], fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)

This modification yields very satisfying results.

like image 157
statespace Avatar answered Dec 15 '22 09:12

statespace


A more correct answer with Rcpp:

library(Rcpp)

cppFunction('NumericVector commonChars(CharacterVector x, CharacterVector y) {
  int len = x.size();
  NumericVector out(len);
  double percentage;

  int count=0,k=0;
  std::string compared;
  std::string source;

  for (int i=0; i<len;++i) {
    source = x[i];
    compared = y[i];
    count=0;
    k=0;

    for (int j=0;j<compared.length();j++) {
      if (source[j] == compared[j]) { count++; continue; }

      while(k < source.length()) {
        if (source[j] == compared[k]) { count++; break; }
        k++;
      }
    }
    percentage = (count+0.0)/(source.length()+0.0);
    out[i] = percentage;
  }
  return out;
}')

Giving:

> commonChars(df$a,df$b)
[1] 1.0 0.7 1.0 0.9 0.7

I didn't bench it against other answers nor with large dataframe.


Not really what you're wishing but here's an idea (I'll try to improve it):

df$r <- gsub("\\w","(\1)?",df$a)
for (i in 1:length(df$a)) {
   df$percentage[i] < ( as.integer( 
                           attr( 
                             regexpr( df$r[i], df$b[i]), 
                             "match.length" 
                           ) 
                       ) / str_length(df$a[i]) * 100) 
}

Output:

               a          b                                        r percentage
1 NEWYORK001 NEWYORK001 (N)?(E)?(W)?(Y)?(O)?(R)?(K)?(0)?(0)?(1)?        100
2 ORLANDO002    ORLANDO (O)?(R)?(L)?(A)?(N)?(D)?(O)?(0)?(0)?(2)?         70
3  BOSTON003  BOSTON003     (B)?(O)?(S)?(T)?(O)?(N)?(0)?(0)?(3)?        100
4 CHICAGO004 CHICAGO005 (C)?(H)?(I)?(C)?(A)?(G)?(O)?(0)?(0)?(4)?         90
5 ATLANTA005 005ATLANTA (A)?(T)?(L)?(A)?(N)?(T)?(A)?(0)?(0)?(5)?         30

Drawbacks:

  • There's a for loop
  • ATLANTA005 return 30% because of the 005 matching only in the order.

I'll see if I can find a way to build a better regexp

like image 35
Tensibai Avatar answered Dec 15 '22 10:12

Tensibai