Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient data structure for word lookup with wildcards

Tags:

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).

So if the user entered:

"orange" it should match an entry "orange' in the dictionary.

Now the catch is that the user can also enter a wildcard or series of wildcard characters like say

"or__ge" which would also match "orange"

The key requirements are:

* this should be as fast as possible.

* use the smallest amount of memory to achieve it.  

If the size of the word list was small I could use a string containing all the words and use regular expressions.

however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.

So is some sort of 'tree' be the way to go for this...?

Any thoughts or suggestions on this would be totally appreciated!

Thanks in advance, Matt

like image 752
Sway Avatar asked May 11 '10 23:05

Sway


People also ask

What are the wildcards used for pattern matching?

Use the asterisk ( * ) to match any sequence or string of characters. The ( * ) indicates any characters, including no characters.

What is the pattern matching the wildcard in unix?

The patterns are called file globs. The name "glob" comes from the name of the original UNIX program that expanded the pattern into a set of matching filenames. A string is a wildcard pattern if it contains one of the characters '?' , '*' or '['.

What wildcard character is used to search for values that contain a search term in a list?

You can use the asterisk (*) anywhere in a character string.

What is a wildcard in word processing?

Alternatively referred to as a wild character or wildcard character, a wildcard is a symbol used to replace or represent one or more characters. The most common wildcards are the asterisk (*), which represents one or more characters and question mark (?) that represents a single character.


2 Answers

Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.

The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.

Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.

Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.

like image 156
Norman Ramsey Avatar answered Oct 25 '22 06:10

Norman Ramsey


I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)

However if that wasn't good enough I would probably use a prefix tree for this.

The basic structure is a tree where:

  • The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
  • The next level down contains all the possible second letters for each given first letter
  • And so on until you reach an "end of word" marker for each word

Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.

In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.

like image 34
mikera Avatar answered Oct 25 '22 06:10

mikera