Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for multiple partial phrases so that one original phrase can not match multiple search phrases

Tags:

Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:

index      phrase
-----------------------------------------
0          Stack Overflow
1          Math Overflow
2          Super User
3          Webmasters
4          Electrical Engineering
5          Programming Jokes
6          Programming Puzzles
7          Geographic Information Systems 

The expected behaviour is:

query         result
------------------------------------------------------------------------
s             Stack Overflow, Super User, Geographic Information Systems
web           Webmasters
over          Stack Overflow, Math Overflow
super u       Super User
user s        Super User
e e           Electrical Engineering
p             Programming Jokes, Programming Puzzles
p p           Programming Puzzles

To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).

To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles has index = 6. Therefore, I add 6 to all the following nodes:

p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles

The problem is, when I search for the query prog p, I first get a list of indices for prog which is [5, 6]. Then, I get a list of indices for p which is [5, 6] as well. Finally, I calculate the intersection between the two, and return the result [5, 6], which is obviously wrong (should be [6]).

How would you fix this?

like image 343
Misha Moroshko Avatar asked May 08 '15 12:05

Misha Moroshko


2 Answers

Key Observation

We can use the fact that two words in a query can match the same word in a phrase only if one query word is a prefix of the other query word (or if they are same). So if we process the query words in descending lexicographic order (prefixes come after their "superwords"), then we can safely remove words from the phrases at the first match. Doing so we left no possibility to match the same phrase word twice. As I said, it is safe because prefixes match superset of phrase words what their "superwords" can match, and pair of query words, where one is not a prefix of the other, always match disjoint set of phrase words.

We don't have to remove words from phrases or the trie "physically", we can do it "virtually".

Implementation of the Algorithm

var PhraseSearch = function () {   
    var Trie = function () {
        this.phraseWordCount = {};
        this.children = {};
    };

    Trie.prototype.addPhraseWord = function (phrase, word) {
        if (word !== '') {
            var first = word.charAt(0);

            if (!this.children.hasOwnProperty(first)) {
                this.children[first] = new Trie();
            }
            var rest = word.substring(1);
            this.children[first].addPhraseWord(phrase, rest);
        }
        if (!this.phraseWordCount.hasOwnProperty(phrase)) {
            this.phraseWordCount[phrase] = 0;
        }
        this.phraseWordCount[phrase]++;
    };

    Trie.prototype.getPhraseWordCount = function (prefix) {
        if (prefix !== '') {
            var first = prefix.charAt(0);

            if (this.children.hasOwnProperty(first)) {
                var rest = prefix.substring(1);
                return this.children[first].getPhraseWordCount(rest);
            } else {
                return {};
            }
        } else {
            return this.phraseWordCount;
        }
    }

    this.trie = new Trie();
}

PhraseSearch.prototype.addPhrase = function (phrase) {
    var words = phrase.trim().toLowerCase().split(/\s+/);
    words.forEach(function (word) {
        this.trie.addPhraseWord(phrase, word);
    }, this);
}

PhraseSearch.prototype.search = function (query) {
    var answer = {};
    var phraseWordCount = this.trie.getPhraseWordCount('');
    for (var phrase in phraseWordCount) {
        if (phraseWordCount.hasOwnProperty(phrase)) {
            answer[phrase] = true;
        }
    }

    var prefixes = query.trim().toLowerCase().split(/\s+/);

    prefixes.sort();
    prefixes.reverse();

    var prevPrefix = '';
    var superprefixCount = 0;

    prefixes.forEach(function (prefix) {
        if (prevPrefix.indexOf(prefix) !== 0) {
            superprefixCount = 0;
        }
        phraseWordCount = this.trie.getPhraseWordCount(prefix);

        function phraseMatchedWordCount(phrase) {
            return phraseWordCount.hasOwnProperty(phrase) ? phraseWordCount[phrase] - superprefixCount : 0;
        }

        for (var phrase in answer) {
            if (answer.hasOwnProperty(phrase) && phraseMatchedWordCount(phrase) < 1) {
                delete answer[phrase];
            }
        }

        prevPrefix = prefix;
        superprefixCount++;
    }, this);

    return Object.keys(answer);
}

function test() {
    var phraseSearch = new PhraseSearch();

    var phrases = [
        'Stack Overflow',
        'Math Overflow',
        'Super User',
        'Webmasters',
        'Electrical Engineering',
        'Programming Jokes',
        'Programming Puzzles',
        'Geographic Information Systems'
    ];

    phrases.forEach(phraseSearch.addPhrase, phraseSearch);

    var queries = {
        's':       'Stack Overflow, Super User, Geographic Information Systems',
        'web':     'Webmasters',
        'over':    'Stack Overflow, Math Overflow',
        'super u': 'Super User',
        'user s':  'Super User',
        'e e':     'Electrical Engineering',
        'p':       'Programming Jokes, Programming Puzzles',
        'p p':     'Programming Puzzles'
    };

    for(var query in queries) {
        if (queries.hasOwnProperty(query)) {
            var expected = queries[query];
            var actual = phraseSearch.search(query).join(', ');

            console.log('query: ' + query);
            console.log('expected: ' + expected);
            console.log('actual: ' + actual);
        }
    }
}

One can test this code here: http://ideone.com/RJgj6p

Possible Optimizations

  • Storing the phrase word count in each trie node is not very memory efficient. But by implementing compressed trie it is possible to reduce the worst case memory complexity to O(n m), there n is the number of different words in all the phrases, and m is the total number of phrases.

  • For simplicity I initialize answer by adding all the phrases. But a more time efficient approach is to initialize answer by adding the phrases matched by the query word matching least number of phrases. Then intersect with the phrases of the query word matching second least number of phrases. And so on...

Relevant Differences from the Implementation Referenced in the Question

  1. In trie node I store not only the phrase references (ids) matched by the subtrie, but also the number of matched words in these phrases. So, the result of the match is not only the matched phrase references, but also the number of matched words in these phrases.
  2. I process query words in descending lexicographic order.
  3. I subtract the number of superprefixes (query words of which the current query word is a prefix) from current match results (by using variable superprefixCount), and a phrase is considered matched by the current query word only when the resulting number of matched words in it is greater than zero. As in the original implementation, the final result is the intersection of the matched phrases.

As one can see, changes are minimal and asymptotic complexities (both time and memory) are not changed.

like image 171
dened Avatar answered Nov 07 '22 21:11

dened


If the set of phrases is defined and does not contain long phrases, maybe you can create not 1 trie, but n tries, where n is the maximum number of words in one phrase.

In i-th trie store i-th word of the phrase. Let's call it the trie with label 'i'.

To process query with m words let's consider the following algorithm:

  1. For each phrase we will store the lowest label of a trie, where the word from this phrase was found. Let's denote it as d[j], where j is the phrase index. At first for each phrase j, d[j] = -1.
  2. Search the first word in each of n tries.
  3. For each phrase j find the label of a trie that is greater than d[j] and where the word from this phrase was found. If there are several such labels, pick the smallest one. Let's denote such label as c[j].
  4. If there is no such index, this phrase can not be matched. You can mark this case with d[j] = n + 1.
  5. If there is such c[j] that c[j] > d[j], than assign d[j] = c[j].
  6. Repeat for every word left.
  7. Every phrase with -1 < d[j] < n is matched.

This is not very optimal. To improve performance you should store only usable values of d array. After first word, store only phrases, matched with this word. Also, instead of assignment d[j] = n + 1, delete index j. Process only already stored phrase indexes.

like image 40
Fefer_Ivan Avatar answered Nov 07 '22 21:11

Fefer_Ivan