Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining multiple regular expressions into one automaton

Let's say that I have a list of regular expressions (read from an external source - file, database etc). I want to check which of these regular expressions a string matches.

I can create iterate through all these regular expression and match them, but the list could be a huge one and this is a critical operation.

I can combine all these regular expressions into one (with | between them), but then the problem is that I can identify only the first matched regular expression, not all.

Another idea could be to create an automaton for all these regular expression and to mark the final states with, let's say, the indexes of the corresponding regular expression. I was looking at http://cs.au.dk/~amoeller/automaton/, a library which seems capable to work with regular expressions and automaton, but not sure it can be extended to solve my problem.

Do you have any other ideas?

To clarify some comments, I've added a code sample:

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class PatternTest {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("(a(?:b|c)a)|((?:a|b)ba)|(ab(?:a|c))");     
        Matcher m = p.matcher("aba");
        System.out.println(m.matches());
        System.out.println(m.groupCount());
        for (int i = 0, n = m.groupCount(); i < n; i++) {
            System.out.println(m.group(i));
        }
    }
}

will print out

true
3
aba
aba
null

As you can see only the first group is matched and I don't see a way to match the other two ones.

More findings - Using the automaton library above, the problem will reduce to the following: if you concatenate two or more automatons, how you can identify for a final state to which of the original automatons is corresponding?

like image 966
Adrian Ber Avatar asked Mar 08 '13 15:03

Adrian Ber


2 Answers

dk.brics.automaton does not directly support this, but you can generalize the representation of automata (and the relevant automata operations) to distinguish between different kinds of accept states. Start by adding an int field, for example, to the State class and use this field whenever 'accept' is set.

like image 133
Anders Møller Avatar answered Oct 23 '22 00:10

Anders Møller


I implemented such a solution based on dk.brics.automaton, you can find it here. https://github.com/fulmicoton/multiregexp

like image 27
fulmicoton Avatar answered Oct 22 '22 23:10

fulmicoton