Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashtable/dictionary/map lookup with regular expressions

Tags:

I'm trying to figure out if there's a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the keys are regular expressions and strings are looked up against the set of keys. For example (in Python syntax):

>>> regex_dict = { re.compile(r'foo.') : 12, re.compile(r'^FileN.*$') : 35 } >>> regex_dict['food'] 12 >>> regex_dict['foot in my mouth'] 12 >>> regex_dict['FileNotFoundException: file.x does not exist'] 35 

(Obviously the above example won't work as written in Python, but that's the sort of thing I'd like to be able to do.)

I can think of a naive way to implement this, in which I iterate over all of the keys in the dictionary and try to match the passed in string against them, but then I lose the O(1) lookup time of a hash map and instead have O(n), where n is the number of keys in my dictionary. This is potentially a big deal, as I expect this dictionary to grow very large, and I will need to search it over and over again (actually I'll need to iterate over it for every line I read in a text file, and the files can be hundreds of megabytes in size).

Is there a way to accomplish this, without resorting to O(n) efficiency?

Alternatively, if you know of a way to accomplish this sort of a lookup in a database, that would be great, too.

(Any programming language is fine -- I'm using Python, but I'm more interested in the data structures and algorithms here.)

Someone pointed out that more than one match is possible, and that's absolutely correct. Ideally in this situation I'd like to return a list or tuple containing all of the matches. I'd settle for the first match, though.

I can't see O(1) being possible in that scenario; I'd settle for anything less than O(n), though. Also, the underlying data structure could be anything, but the basic behavior I'd like is what I've written above: lookup a string, and return the value(s) that match the regular expression keys.

like image 305
Jeff Avatar asked Nov 03 '08 21:11

Jeff


2 Answers

What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.

Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.

  • single characters just become trie nodes.
  • .'s become wildcard insertions covering all children of the current trie node.
  • *'s become back links in the trie to node at the start of the previous item.
  • [a-z] ranges insert the same subsequent child nodes repeatedly under each of the characters in the range. With care, while inserts/updates may be somewhat expensive the search can be linear in the size of the string. With some placeholder stuff the common combinatorial explosion cases can be kept under control.
  • (foo)|(bar) nodes become multiple insertions

This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.

Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)

like image 172
Edward Kmett Avatar answered Jan 01 '23 22:01

Edward Kmett


This is not possible to do with a regular hash table in any language. You'll either have to iterate through the entire keyset, attempting to match the key to your regex, or use a different data structure.

You should choose a data structure that is appropriate to the problem you're trying to solve. If you have to match against any arbitrary regular expression, I don't know of a good solution. If the class of regular expressions you'll be using is more restrictive, you might be able to use a data structure such as a trie or suffix tree.

like image 36
Adam Rosenfield Avatar answered Jan 01 '23 21:01

Adam Rosenfield