Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?


In my application I receive two lists from the user:

  1. list of regular expressions
  2. list of strings

and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).

I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.

The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.


note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.

like image 472
CAFxX Avatar asked Dec 15 '12 07:12

CAFxX


People also ask

What is a regular expression?

On an abstract level a regular expression, regex for short, is a shorthand representation for a set. A set of strings. Say we have a list of all valid z i p codes.

How does combining work in regex?

Combining works if the intended result of combination is that any of them matching results in the whole regexp matching. Show activity on this post. If a string must not contain @, every character must be another character than @: This will match any string of any length that does not contain @.

What is the use of compiled regex?

If a Regex object is constructed with the RegexOptions.Compiled option, it compiles the regular expression to explicit MSIL code instead of high-level regular expression internal instructions. This allows .NET's just-in-time (JIT) compiler to convert the expression to native machine code for higher performance.

How do I Group and alternate terms in regular expressions?

Grouping and alternation are core features of every modern regular expression library. You can provide as many terms as desired, as long as they are separated with the pipe character: |. This character separates terms contained within each (...) group. Take the following example, for instance: I like dogs, but not lions.


1 Answers

Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.

[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]

Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.

So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:

  • Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
  • Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
  • Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
  • Whenever you encounter a backslash there are three options:
    1. The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
    2. The next characters are k<...>. Again replace this with the corresponding new name.
    3. The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
  • I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.

Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.

I hope this gave you a pointer in the right direction.

like image 140
Martin Ender Avatar answered Oct 02 '22 00:10

Martin Ender