I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists). So if the user entered: <pre class="prettyprint"><code>"orange" it should match an entry "orange' in the dictionary. </code></pre> Now the catch is that the user can also enter a wildcard or series of wildcard characters like say <pre class="prettyprint"><code>"or__ge" which would also match "orange" </code></pre> The key requirements are: <pre class="prettyprint"><code>* this should be as fast as possible. * use the smallest amount of memory to achieve it. </code></pre> If the size of the word list was small I could use a string containing all the words and use regular expressions. however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work. So is some sort of 'tree' be the way to go for this...? Any thoughts or suggestions on this would be totally appreciated! Thanks in advance, Matt

Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set. The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement. Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on. Matching an arbitrary sequence of wildcards, e.g., <code>______</code> is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.

Efficient data structure for word lookup with wildcards

Tags:

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).

So if the user entered:

"orange" it should match an entry "orange' in the dictionary.

Now the catch is that the user can also enter a wildcard or series of wildcard characters like say

"or__ge" which would also match "orange"

The key requirements are:

* this should be as fast as possible.

* use the smallest amount of memory to achieve it.

If the size of the word list was small I could use a string containing all the words and use regular expressions.

however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.

So is some sort of 'tree' be the way to go for this...?

Any thoughts or suggestions on this would be totally appreciated!

Thanks in advance, Matt

752

asked May 11 '10 23:05

Sway

2 Answers

Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.

The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.

Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.

Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.

156

answered Oct 25 '22 06:10

Norman Ramsey

I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)

However if that wasn't good enough I would probably use a prefix tree for this.

The basic structure is a tree where:

The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word

Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.

In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.

answered Oct 25 '22 06:10

mikera

Related questions
                            
                                How to bind collection to WPF:DataGridComboBoxColumn
                            
                                need a virtual template member workaround
                            
                                what changes when your input is giga/terabyte sized?
                            
                                Remove the complete styling of an HTML button/submit
                            
                                Finding the maximum subsequence binary sets that have an equal number of 1s and 0s
                            
                                C# Drag-and-Drop: Show the dragged item while dragging
                            
                                Java: anonymous enums?
                            
                                Every derived table must have its own alias error
                            
                                get clicks through html canvas
                            
                                friend in operator == or << when should i use it?
                            
                                Does Maven support properties inheritance?
                            
                                Configure ASP.NET Session State at runtime

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With