Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring the identicality of strings (in Javascript)

In principle this question can be answered language-independent, but specifically I am looking for a Javascript implementation.

Are there any libraries that allow me to measure the "identicality" of two strings? More generally, are there any algorithms that do this, that I could implement (in Javascript)?

Take, as an example, the following string

Abnormal Elasticity of Single-Crystal Magnesiosiderite across the Spin Transition in Earth’s Lower Mantle

And also consider the following, slightly adjusted string. Note the boldface parts that are different

bnormal Elasticity of Single Crystal Magnesio-Siderite across the Spin-Transition in Earths Lower Mantle.

Javascript's native equality operators won't tell you a lot about the relation between these strings. In this particular case, you could match the strings using regex, but in general that only works when you know which differences to expect. If the input strings are random, the generality of this approach breaks down quickly.

Approach... I can imagine writing an algorithm that splits up the input string in an arbitrary amount N of substrings, and then matching the target string with all those substrings, and using the amount of matches as a measurement of identicality. But this feels like an unattractive approach, and I wouldn't even want to think about how big O will depend on N.

It would seem to me that there are a lot of free parameters in such an algorithm. For example, whether case-sensitivity of characters should contribute equally/more/less to the measurement than order-preservation of characters, seems like an arbitrary choice to make by the designer, i.e.:

identicality("Abxy", "bAxy") versus identicality("Abxy", "aBxy")

Defining the requirements more specifically... The first example is the scenario in which I could use it. I'm loading a bunch of strings (titles of academic papers), and I check whether I have them in my database. However, the source might contain typos, differences in conventions, errors, whatever, which makes matching hard. There probably is a more easy way to match titles in this specific scenario: as you can sort of expect what might go wrong, this allows you to write down some regex beast.

like image 807
Maurits Moeys Avatar asked Jan 26 '17 12:01

Maurits Moeys


1 Answers

You can implement Hirschberg's algorithm and distinguish delete/insert operations
(or alter Levenshtein).

For Hirschbers("Abxy", "bAxy") the results are:

It was 2 edit operations:
keep: 3
insert: 1
delete: 1

Abxy to bAxy

and for Hirschbers("Abxy", "aBxy") the results are:

It was 2 edit operations:
keep: 2
replace: 2

Abxy to aBxy

You can check the javascript implementation on this page.

'Optimal' String-Alignment Distance

function optimalStringAlignmentDistance(s, t) {
  // Determine the "optimal" string-alignment distance between s and t
  if (!s || !t) {
    return 99;
  }
  var m = s.length;
  var n = t.length;
  
  /* For all i and j, d[i][j] holds the string-alignment distance
   * between the first i characters of s and the first j characters of t.
   * Note that the array has (m+1)x(n+1) values.
   */
  var d = new Array();
  for (var i = 0; i <= m; i++) {
    d[i] = new Array();
    d[i][0] = i;
  }
  for (var j = 0; j <= n; j++) {
    d[0][j] = j;
  }
        
  // Determine substring distances
  var cost = 0;
  for (var j = 1; j <= n; j++) {
    for (var i = 1; i <= m; i++) {
      cost = (s.charAt(i-1) == t.charAt(j-1)) ? 0 : 1;   // Subtract one to start at strings' index zero instead of index one
      d[i][j] = Math.min(d[i][j-1] + 1,                  // insertion
                         Math.min(d[i-1][j] + 1,         // deletion
                                  d[i-1][j-1] + cost));  // substitution
                        
      if(i > 1 && j > 1 && s.charAt(i-1) == t.charAt(j-2) && s.charAt(i-2) == t.charAt(j-1)) {
        d[i][j] = Math.min(d[i][j], d[i-2][j-2] + cost); // transposition
      }
    }
  }
  
  // Return the strings' distance
  return d[m][n];
}

alert(optimalStringAlignmentDistance("Abxy", "bAxy"))
alert(optimalStringAlignmentDistance("Abxy", "aBxy"))

Damerau-Levenshtein Distance

function damerauLevenshteinDistance(s, t) {
  // Determine the Damerau-Levenshtein distance between s and t
  if (!s || !t) {
    return 99;
  }
  var m = s.length;
  var n = t.length;      
  var charDictionary = new Object();
  
  /* For all i and j, d[i][j] holds the Damerau-Levenshtein distance
   * between the first i characters of s and the first j characters of t.
   * Note that the array has (m+1)x(n+1) values.
   */
  var d = new Array();
  for (var i = 0; i <= m; i++) {
    d[i] = new Array();
    d[i][0] = i;
  }
  for (var j = 0; j <= n; j++) {
    d[0][j] = j;
  }
  
  // Populate a dictionary with the alphabet of the two strings
  for (var i = 0; i < m; i++) {
    charDictionary[s.charAt(i)] = 0;
  }
  for (var j = 0; j < n; j++) {
    charDictionary[t.charAt(j)] = 0;
  }
  
  // Determine substring distances
  for (var i = 1; i <= m; i++) {
    var db = 0;
    for (var j = 1; j <= n; j++) {
      var i1 = charDictionary[t.charAt(j-1)];
      var j1 = db;
      var cost = 0;
      
      if (s.charAt(i-1) == t.charAt(j-1)) { // Subtract one to start at strings' index zero instead of index one
        db = j;
      } else {
        cost = 1;
      }
      d[i][j] = Math.min(d[i][j-1] + 1,                 // insertion
                         Math.min(d[i-1][j] + 1,        // deletion
                                  d[i-1][j-1] + cost)); // substitution
      if(i1 > 0 && j1 > 0) {
        d[i][j] = Math.min(d[i][j], d[i1-1][j1-1] + (i-i1-1) + (j-j1-1) + 1); //transposition
      }
    }
    charDictionary[s.charAt(i-1)] = i;
  }
        
  // Return the strings' distance
  return d[m][n];
}

alert(damerauLevenshteinDistance("Abxy", "aBxy"))
alert(damerauLevenshteinDistance("Abxy", "bAxy"))

Optimal String Alignment has better performance

Optimal String Alignment Distance 0.20-0.30ms
Damerau-Levenshtein Distance 0.40-0.50ms

like image 150
Panos Kal. Avatar answered Oct 19 '22 23:10

Panos Kal.