Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

capture a string and then match all other words that begins with that string

I have a list of 80,000+ words each separated by a newline. I need to match every word that contains, as its prefix, a smaller word. For example,

bald    <-- captures bald
balder  <-- matches because it starts with bald
balding <-- matches because it starts with bald
care    <-- captures care
cared   <-- matches because it starts with care
cares   <-- matches because it starts with care
caring  <-- does NOT match because it does not start with care

I'll be using find & replace in sublime text so I'd like to be able to user the replace all the matches with "" thus removing them from my list.

Okay, here is the backstory:

My list of words is basically an abridged version of the English dictionary. Using regex I've been able to remove all proper-nouns, abbreviations, words with accented characters, and all words less than 4 letters long. I'll be using this dictionary for a javascript word game I'm making. (Yes, this is for an assignment, but it's not for school credit and the assignment is simple to make a simple javascript game. My game logic works and I could edit the word list by hand, but I'd like it to be done before 2016 so regex seems to be the way to go).

The point of the game is to force your opponent to finish spelling a word. Players take turns adding letters to a string and as soon as the string matches a word in the dictionary gameplay comes to an end. For that reason, words like overblown, overhead, and overkill are dead weight. Once the over in overhead is spelled out, the game is...well...over.

I'll be loading the wordList as an array in a javascript file so I'd like it to be as small as possible.

I'm sure there are other ways of doing this (api's and such) but we can't use them for this assignment.

Any help would be very much appreciated!

like image 858
user7161 Avatar asked Jul 26 '15 05:07

user7161


People also ask

How to regex end of line?

End of String or Line: $ The $ anchor specifies that the preceding pattern must occur at the end of the input string, or before \n at the end of the input string. If you use $ with the RegexOptions. Multiline option, the match can also occur at the end of a line.

How do you search for a word in regex?

To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.


2 Answers

An efficient structure to store lists of words is a prefix tree. For example, given a dictionary like

'car',
'card',
'carder',
'care',
'cared',
'cares',
'caring',
'can'

the trie might look like this

enter image description here

(where 0 denotes the end of word).

The code to build a trie is fairly simple:

function buildTree(words) {
    var tree = {};
    words.forEach(function (word) {
        var t = tree;
        [].forEach.call(word + "0", function (char) {
            t = t[char] || (t[char] = {});
        });
    });
    return tree;
}

Now, to enumerate all words that start with a given prefix, just walk the trie recursively and collect matching words:

function findWords(prefix, tree) {
    var found = [];

    function walk(pfx, t, word) {
        if (!pfx) {
            if (t[0])
                found.push(word)
            for (var c in t)
                walk("", t[c], word + c);
        } else if (pfx[0] in t)
            walk(pfx.substr(1), t[pfx[0]], word + pfx[0]);
    }

    walk(prefix, tree, "");
    return found;
}

Complete code:

function buildTree(words) {
    var tree = {};
    words.forEach(function (word) {
        var t = tree;
        [].forEach.call(word + "0", function (char) {
            t = t[char] || (t[char] = {});
        });
    });
    return tree;
}

function findWords(prefix, tree) {
    var found = [];

    function walk(pfx, t, word) {
        if (!pfx) {
            if (t[0])
                found.push(word)
            for (var c in t)
                walk("", t[c], word + c);
        } else if (pfx[0] in t)
            walk(pfx.substr(1), t[pfx[0]], word + pfx[0]);
    }

    walk(prefix, tree, "");
    return found;
}

words = [
    'car',
    'card',
    'carder',
    'care',
    'cared',
    'cares',
    'caring',
    'can'

]

prefixTree = buildTree(words);
document.write(findWords("care", prefixTree));

To remove words that start with another word, you can build the trie as above and then walk it, cutting the search once a terminal marker (0) is found:

function buildTree(words) {
    var tree = {};
    words.forEach(function (word) {
        var t = tree;
        [].forEach.call(word + "0", function (char) {
            t = t[char] || (t[char] = {});
        });
    });
    return tree;
}


function findShortWords(tree) {
    var found = [];

    function walk(t, word) {
        if(t[0]) {
            found.push(word);
            return;
          }
        for (var c in t)
            walk(t[c], word + c);
    }

    walk(tree, "");
    return found;
}

words = [
    'card',
    'carder',
    'care',
    'cared',
    'cares',
    'caring',
    'can',
    'canoe',
    'bald',
    'balder',
    'balding',
    'foo'

]

prefixTree = buildTree(words);

document.write(findShortWords(prefixTree));
like image 176
georg Avatar answered Sep 19 '22 10:09

georg


If you can reverse-sort your word list first then there is a RegExp that might help you. So far I have only tested it in JavaScript-replace on my iPod but I believe Sublime Text should be able to do the same. Here is my sample text (with a few more added words to spice it up...):

var s="runs,running,runner,run,forging,forgetting,forgets,forget,forged,forge,caring,cares,cared,care,balding,balder,bald"

I will shortly transform that into a list of words separated by newlines and then apply my RegExp:

s=s.replace(/,/g,'\n');
s.replace(/(\w+)\w+\n(?=(?:.*\n)*\1(?:\n|$))/g,'');

This will remove all "unwanted" lines completely and get you

run
forge
caring
care
bald

If, however, you still want empty lines in place of the unwanted words you could do

s.replace(/(\w+)\w+(?=\n(?:.*\n)*\1(?:\n|$))/g,'');

The reverse word order is essential to my approach, as I am using a look-ahead of variable length to look for the base form of the word while I am potentially capturing an inflected form of it in the main pattern (\w+)\w+. Interestingly I never believed it would work like this, but it does! The capturing and non-capturing portions of the line are not clearly defined but the RegExp engine tries different lengths until the capturing group is found in the look-ahead pattern \n(?:.*\n)*\1(?:\n|$). In my JavaScript sample it was enough to set the g flag of the RegExp. In Sublime text you will most likely also have to set the multi-line flag equivalent of m.

There is still one fundamental flaw in the whole approach as you might have noticed already: the stem "forge" did not only eradicate its own other verb forms but also took away all forms of "forget". This is a "logical" problem linked to the fact that a base verb form is never clearly defined in a plain list of words.

update

Just read your question again. For the purpose of your game the "fundamental flaw" is actually something you want. So just ignore my last paragraph. ;-)

like image 29
Carsten Massmann Avatar answered Sep 18 '22 10:09

Carsten Massmann