In principle this question can be answered language-independent, but specifically I am looking for a Javascript implementation.
Are there any libraries that allow me to measure the "identicality" of two strings? More generally, are there any algorithms that do this, that I could implement (in Javascript)?
Take, as an example, the following string
Abnormal Elasticity of Single-Crystal Magnesiosiderite across the Spin Transition in Earth’s Lower Mantle
And also consider the following, slightly adjusted string. Note the boldface parts that are different
bnormal Elasticity of Single Crystal Magnesio-Siderite across the Spin-Transition in Earths Lower Mantle.
Javascript's native equality operators won't tell you a lot about the relation between these strings. In this particular case, you could match the strings using regex, but in general that only works when you know which differences to expect. If the input strings are random, the generality of this approach breaks down quickly.
Approach... I can imagine writing an algorithm that splits up the input string in an arbitrary amount N
of substrings, and then matching the target string with all those substrings, and using the amount of matches as a measurement of identicality. But this feels like an unattractive approach, and I wouldn't even want to think about how big O will depend on N
.
It would seem to me that there are a lot of free parameters in such an algorithm. For example, whether case-sensitivity of characters should contribute equally/more/less to the measurement than order-preservation of characters, seems like an arbitrary choice to make by the designer, i.e.:
identicality("Abxy", "bAxy")
versusidenticality("Abxy", "aBxy")
Defining the requirements more specifically... The first example is the scenario in which I could use it. I'm loading a bunch of strings (titles of academic papers), and I check whether I have them in my database. However, the source might contain typos, differences in conventions, errors, whatever, which makes matching hard. There probably is a more easy way to match titles in this specific scenario: as you can sort of expect what might go wrong, this allows you to write down some regex beast.
You can implement Hirschberg's algorithm and distinguish delete/insert
operations
(or alter Levenshtein
).
For Hirschbers("Abxy", "bAxy")
the results are:
It was 2 edit operations:
keep: 3
insert: 1
delete: 1
and for Hirschbers("Abxy", "aBxy")
the results are:
It was 2 edit operations:
keep: 2
replace: 2
You can check the javascript implementation on this page.
function optimalStringAlignmentDistance(s, t) {
// Determine the "optimal" string-alignment distance between s and t
if (!s || !t) {
return 99;
}
var m = s.length;
var n = t.length;
/* For all i and j, d[i][j] holds the string-alignment distance
* between the first i characters of s and the first j characters of t.
* Note that the array has (m+1)x(n+1) values.
*/
var d = new Array();
for (var i = 0; i <= m; i++) {
d[i] = new Array();
d[i][0] = i;
}
for (var j = 0; j <= n; j++) {
d[0][j] = j;
}
// Determine substring distances
var cost = 0;
for (var j = 1; j <= n; j++) {
for (var i = 1; i <= m; i++) {
cost = (s.charAt(i-1) == t.charAt(j-1)) ? 0 : 1; // Subtract one to start at strings' index zero instead of index one
d[i][j] = Math.min(d[i][j-1] + 1, // insertion
Math.min(d[i-1][j] + 1, // deletion
d[i-1][j-1] + cost)); // substitution
if(i > 1 && j > 1 && s.charAt(i-1) == t.charAt(j-2) && s.charAt(i-2) == t.charAt(j-1)) {
d[i][j] = Math.min(d[i][j], d[i-2][j-2] + cost); // transposition
}
}
}
// Return the strings' distance
return d[m][n];
}
alert(optimalStringAlignmentDistance("Abxy", "bAxy"))
alert(optimalStringAlignmentDistance("Abxy", "aBxy"))
function damerauLevenshteinDistance(s, t) {
// Determine the Damerau-Levenshtein distance between s and t
if (!s || !t) {
return 99;
}
var m = s.length;
var n = t.length;
var charDictionary = new Object();
/* For all i and j, d[i][j] holds the Damerau-Levenshtein distance
* between the first i characters of s and the first j characters of t.
* Note that the array has (m+1)x(n+1) values.
*/
var d = new Array();
for (var i = 0; i <= m; i++) {
d[i] = new Array();
d[i][0] = i;
}
for (var j = 0; j <= n; j++) {
d[0][j] = j;
}
// Populate a dictionary with the alphabet of the two strings
for (var i = 0; i < m; i++) {
charDictionary[s.charAt(i)] = 0;
}
for (var j = 0; j < n; j++) {
charDictionary[t.charAt(j)] = 0;
}
// Determine substring distances
for (var i = 1; i <= m; i++) {
var db = 0;
for (var j = 1; j <= n; j++) {
var i1 = charDictionary[t.charAt(j-1)];
var j1 = db;
var cost = 0;
if (s.charAt(i-1) == t.charAt(j-1)) { // Subtract one to start at strings' index zero instead of index one
db = j;
} else {
cost = 1;
}
d[i][j] = Math.min(d[i][j-1] + 1, // insertion
Math.min(d[i-1][j] + 1, // deletion
d[i-1][j-1] + cost)); // substitution
if(i1 > 0 && j1 > 0) {
d[i][j] = Math.min(d[i][j], d[i1-1][j1-1] + (i-i1-1) + (j-j1-1) + 1); //transposition
}
}
charDictionary[s.charAt(i-1)] = i;
}
// Return the strings' distance
return d[m][n];
}
alert(damerauLevenshteinDistance("Abxy", "aBxy"))
alert(damerauLevenshteinDistance("Abxy", "bAxy"))
Optimal String Alignment
has better performance
Optimal String Alignment Distance0.20-0.30ms
Damerau-Levenshtein Distance0.40-0.50ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With