Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match a string against a set of wildcard strings efficiently?

Tags:

python

I am looking for a solution to match a single string against a set of wildcard strings. For example

>>> match("ab", ["a*", "b*", "*", "c", "*b"])
["a*", "*", "*b"]

The order of the output is of no importance.

I will have in the order of 10^4 wildcard strings to match against and I will do around ~10^9 match calls. This means I will probably have to rewrite my code like so:

>>> matcher = prepare(["a*", "b*", "*", "c", "*b"]
>>> for line in lines: yield matcher.match("ab")
["a*", "*", "*b"]

I've started writing a trie implementation in Python that handles wildcards and I just need to get those corner cases right. Despite this I am curious to hear; How would you solve this? Are there any Python libraries out there that make me solve this faster?

Some insights so far:

  • Named (Python, re) regular expressions will not help me here since they'll only return one match.
  • pyparsing seems like an awesome library, but is sparsely documented and does not, as I see it, support matching multiple patterns.
like image 800
Ztyx Avatar asked Oct 30 '25 09:10

Ztyx


2 Answers

You could use FilteredRE2 class from re2 library with a help from Aho-Corasick algorithm implementation (or similar). From re2 docs:

Required substrings. Suppose you have an efficient way to check which of a list of strings appear as substrings in a large text (for example, maybe you implemented the Aho-Corasick algorithm), but now your users want to be able to do regular expression searches efficiently too. Regular expressions often have large literal strings in them; if those could be identified, they could be fed into the string searcher, and then the results of the string searcher could be used to filter the set of regular expression searches that are necessary. The FilteredRE2 class implements this analysis. Given a list of regular expressions, it walks the regular expressions to compute a boolean expression involving literal strings and then returns the list of strings. For example, FilteredRE2 converts (hello|hi)world[a-z]+foo into the boolean expression “(helloworld OR hiworld) AND foo” and returns those three strings. Given multiple regular expressions, FilteredRE2 converts each into a boolean expression and returns all the strings involved. Then, after being told which of the strings are present, FilteredRE2 can evaluate each expression to identify the set of regular expressions that could possibly be present. This filtering can reduce the number of actual regular expression searches significantly.

The feasibility of these analyses depends crucially on the simplicity of their input. The first uses the DFA form, while the second uses the parsed regular expression (Regexp*). These kind of analyses would be more complicated (maybe even impossible) if RE2 allowed non-regular features in its regular expressions.

like image 192
jfs Avatar answered Nov 02 '25 01:11

jfs


Seems like Aho-Corasick algorithm would work. esmre seem to do what I'm looking for. I got this information from this question.

like image 39
Ztyx Avatar answered Nov 02 '25 01:11

Ztyx