Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm of the greatest intersect of word in set of words

The story behind

I am creating a voice controlled application using x-webkit-speech which is surprisingly good (the feature, not my app), but sometimes the user (me) mumbles a bit. It would be nice to accept the command if some reasonable part of the word matches some reasonable part of some reasonable command. So I search for the holy grail called Algorithm of the Greatest Intersect of Word in Set of Words. Could some fresh bright mind drive me out of the cave of despair?

Example

"rotation" in ["notable","tattoo","onclick","statistically"]

should match tattoo because it has the longest intersect with rotation (tat_o). statistically is the second best (tati intersect), because longer part of the word needs to be ignored (but this is bonus condition, it would be acceptable without it).

Notes

  • I use Czech language where the pronunciation is very close to its written form
  • javascript is the preffered language, but any pseudocode is acceptable
  • the minimal length of the intersect should be a parameter of the algorithm

What have I tried?

Well, it is pretty embarassing....

for(var i=10; i>=4; --i) // reasonable substring
for(var word in words) // for all words in the set
for(var j=0; j<word.length-i; ++j) // search for any i substring
// aaargh... three levels of abstraction is too much for me
like image 762
Jan Turoň Avatar asked Oct 21 '22 08:10

Jan Turoň


1 Answers

This is an algorithm that seems to work. I have no idea how good it performs compared to other already established algorithms (I suspect it perform worse) but maybe it gives you an idea how you could do it:

FIDDLE

var minInt = 3;
var arr = ["notable","tattoo","onclick","statistically"];
var word = "rotation";

var res = [];
if (word.length >= minInt) {
    for (var i = 0; i < arr.length; i++) {
        var comp = arr[i];
        var m = 0;
        if (comp.length >= minInt) {
            for (var l = 0; l < comp.length - minInt + word.length - minInt + 1; l++) {
                var subcomp = l > word.length - minInt ? comp.substring(l - word.length + minInt) : comp;
                var subword = l < word.length - minInt ? word.substring(word.length - minInt - l) : word;
                var minL = Math.min(subcomp.length, subword.length);
                var matches = 0;
                for (var k = 0; k < minL; k++) {
                    if (subcomp[k] === subword[k]) {
                        matches++;
                    }
                }
                if (matches > m) {
                    m = matches;
                }
            }
        }
        res[i] = m >= minInt ? m : null;
    }
}

console.log(res);

What happens is, that it compares the two strings by "moving" on against the other and calculates the matching letters in each position. Here you see the compared "sub"words for rotation vs. notable:

ion / notable --> one match on index 1
tion / notable --> no match
ation / notable --> no match
tation / notable --> one match on index 2
otation / notable --> no match
rotation / notable --> three matches on index 1,2,3
rotation / otable --> no match
rotation / table --> no match
rotation / able --> no match
rotation / ble  --> no match

As you see, the maximum number of matches is 3 and that is what it would return.

like image 69
basilikum Avatar answered Nov 11 '22 08:11

basilikum