Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:
index phrase
-----------------------------------------
0 Stack Overflow
1 Math Overflow
2 Super User
3 Webmasters
4 Electrical Engineering
5 Programming Jokes
6 Programming Puzzles
7 Geographic Information Systems
The expected behaviour is:
query result
------------------------------------------------------------------------
s Stack Overflow, Super User, Geographic Information Systems
web Webmasters
over Stack Overflow, Math Overflow
super u Super User
user s Super User
e e Electrical Engineering
p Programming Jokes, Programming Puzzles
p p Programming Puzzles
To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).
To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles
has index = 6
. Therefore, I add 6
to all the following nodes:
p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles
The problem is, when I search for the query prog p
, I first get a list of indices for prog
which is [5, 6]
. Then, I get a list of indices for p
which is [5, 6]
as well. Finally, I calculate the intersection between the two, and return the result [5, 6]
, which is obviously wrong (should be [6]
).
How would you fix this?
We can use the fact that two words in a query can match the same word in a phrase only if one query word is a prefix of the other query word (or if they are same). So if we process the query words in descending lexicographic order (prefixes come after their "superwords"), then we can safely remove words from the phrases at the first match. Doing so we left no possibility to match the same phrase word twice. As I said, it is safe because prefixes match superset of phrase words what their "superwords" can match, and pair of query words, where one is not a prefix of the other, always match disjoint set of phrase words.
We don't have to remove words from phrases or the trie "physically", we can do it "virtually".
var PhraseSearch = function () {
var Trie = function () {
this.phraseWordCount = {};
this.children = {};
};
Trie.prototype.addPhraseWord = function (phrase, word) {
if (word !== '') {
var first = word.charAt(0);
if (!this.children.hasOwnProperty(first)) {
this.children[first] = new Trie();
}
var rest = word.substring(1);
this.children[first].addPhraseWord(phrase, rest);
}
if (!this.phraseWordCount.hasOwnProperty(phrase)) {
this.phraseWordCount[phrase] = 0;
}
this.phraseWordCount[phrase]++;
};
Trie.prototype.getPhraseWordCount = function (prefix) {
if (prefix !== '') {
var first = prefix.charAt(0);
if (this.children.hasOwnProperty(first)) {
var rest = prefix.substring(1);
return this.children[first].getPhraseWordCount(rest);
} else {
return {};
}
} else {
return this.phraseWordCount;
}
}
this.trie = new Trie();
}
PhraseSearch.prototype.addPhrase = function (phrase) {
var words = phrase.trim().toLowerCase().split(/\s+/);
words.forEach(function (word) {
this.trie.addPhraseWord(phrase, word);
}, this);
}
PhraseSearch.prototype.search = function (query) {
var answer = {};
var phraseWordCount = this.trie.getPhraseWordCount('');
for (var phrase in phraseWordCount) {
if (phraseWordCount.hasOwnProperty(phrase)) {
answer[phrase] = true;
}
}
var prefixes = query.trim().toLowerCase().split(/\s+/);
prefixes.sort();
prefixes.reverse();
var prevPrefix = '';
var superprefixCount = 0;
prefixes.forEach(function (prefix) {
if (prevPrefix.indexOf(prefix) !== 0) {
superprefixCount = 0;
}
phraseWordCount = this.trie.getPhraseWordCount(prefix);
function phraseMatchedWordCount(phrase) {
return phraseWordCount.hasOwnProperty(phrase) ? phraseWordCount[phrase] - superprefixCount : 0;
}
for (var phrase in answer) {
if (answer.hasOwnProperty(phrase) && phraseMatchedWordCount(phrase) < 1) {
delete answer[phrase];
}
}
prevPrefix = prefix;
superprefixCount++;
}, this);
return Object.keys(answer);
}
function test() {
var phraseSearch = new PhraseSearch();
var phrases = [
'Stack Overflow',
'Math Overflow',
'Super User',
'Webmasters',
'Electrical Engineering',
'Programming Jokes',
'Programming Puzzles',
'Geographic Information Systems'
];
phrases.forEach(phraseSearch.addPhrase, phraseSearch);
var queries = {
's': 'Stack Overflow, Super User, Geographic Information Systems',
'web': 'Webmasters',
'over': 'Stack Overflow, Math Overflow',
'super u': 'Super User',
'user s': 'Super User',
'e e': 'Electrical Engineering',
'p': 'Programming Jokes, Programming Puzzles',
'p p': 'Programming Puzzles'
};
for(var query in queries) {
if (queries.hasOwnProperty(query)) {
var expected = queries[query];
var actual = phraseSearch.search(query).join(', ');
console.log('query: ' + query);
console.log('expected: ' + expected);
console.log('actual: ' + actual);
}
}
}
One can test this code here: http://ideone.com/RJgj6p
Storing the phrase word count in each trie node is not very memory efficient. But by implementing compressed trie it is possible to reduce the worst case memory complexity to O(n m), there n is the number of different words in all the phrases, and m is the total number of phrases.
For simplicity I initialize answer
by adding all the phrases. But
a more time efficient approach is to initialize answer
by adding
the phrases matched by the query word matching least number of
phrases. Then intersect with the phrases of the query word matching
second least number of phrases. And so on...
superprefixCount
), and a phrase is considered matched by the current query word only when the resulting number of matched words in it is greater than zero. As in the original implementation, the final result is the intersection of the matched phrases.As one can see, changes are minimal and asymptotic complexities (both time and memory) are not changed.
If the set of phrases is defined and does not contain long phrases, maybe you can create not 1 trie, but n tries, where n is the maximum number of words in one phrase.
In i-th trie store i-th word of the phrase. Let's call it the trie with label 'i'.
To process query with m words let's consider the following algorithm:
This is not very optimal. To improve performance you should store only usable values of d array. After first word, store only phrases, matched with this word. Also, instead of assignment d[j] = n + 1, delete index j. Process only already stored phrase indexes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With