Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm for converting a character set into a nfa/dfa

I'm currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.

The scanner generator produces a scanner for UTF8 encoded files. The full range of characters (0x000000 to 0x10ffff) should be supported.

If I use large character sets, like the any operator '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ). So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).

Here's my current implementation of creating a character set part of the nfa.

void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
    // get the utf8 encoded bytes for the character
    byte[] encoded = EncodingHelper.EncodeCharacter(character);
    int tStartStateIndex = startStateIndex;
    for (int i = 0; i < encoded.Length - 1; i++) {
        int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
        if (tEndStateIndex == -1) {
           tEndStateIndex = CreateState();
               transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
        }                   
        transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
        tStartStateIndex = tEndStateIndex;
    }
    transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}

Does anyone know how to implement the function much more efficiently to create only the necessary states?

EDIT:

To be more specific I need a function like:

List<Set<byte>[]> Convert(Set<int> characters)
{
     ???????
}

A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:

byte[] EncodeCharacter(int character)
{ ... }
like image 251
raisyn Avatar asked Aug 21 '10 19:08

raisyn


2 Answers

There are a number of ways to handle it. They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all. It's also how you make scanners for Unicode in a reasonable amount of memory.

You've many choices about how to represent and process sets of characters. I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states. You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture. In fact, it's fast enough that it runs in Python with acceptable speed.

like image 61
Ian Avatar answered Oct 12 '22 22:10

Ian


I'll clarify what I think you're asking for: to union a set of Unicode codepoints such that you produce a state-minimal DFA where transitions represent UTF8-encoded sequences for those codepoints.

When you say "more efficiently", that could apply to runtime, memory usage, or to compactness of the end result. The usual meaning for "minimal" in finite automata refers to using the fewest states to describe any given language, which is what you're getting at by "create only the necessary states".

Every finite automata has exactly one equivalent state minimal DFA (see the Myhill-Nerode theorem [1], or Hopcroft & Ullman [2]). For your purposes, we can construct this minimal DFA directly using the Aho-Corasick algorithm [3].

To do this, we need a mapping from Unicode codepoints to their corresponding UTF8 encodings. There's no need to store all of these UTF8 byte sequences in advance; they can be encoded on the fly. The UTF8 encoding algorithm is well documented and I won't repeat it here.

Aho-Corasick works by first constructing a trie. In your case this would be a trie of each UTF8 sequence added in turn. Then that trie is annotated with transitions turning it into a DAG per the rest of the algorithm. There's a nice overview of the algorithm here, but I suggest reading the paper itself.

Pseudocode for this approach:

trie = empty
foreach codepoint in input_set:
   bytes[] = utf8_encode(codepoint)
   trie_add_key(bytes)
dfa = add_failure_edges(trie) # per the rest of AC

This approach (forming a trie of UTF8-encoded sequences, then Aho-Corasick, then rendering out DFA) is the approach taken in the implementation for my regexp and finite state machine libraries, where I do exactly this for constructing Unicode character classes. Here you can see code for:

  • UTF8-encoding a Unicode codepoint: examples/utf8dfa/main.c

  • Construction of the trie: libre/ac.c

  • Rendering out of minimal DFA for each character class: libre/class/

Other approaches (as mentioned in other answers to this question) include working on codepoints and expressing ranges of codepoints, rather than spelling out every byte sequence.

[1] Myhill-Nerode: Nerode, Anil (1958), Linear Automaton Transformations, Proceedings of the AMS, 9, JSTOR 2033204
[2] Hopcroft & Ullman (1979), Section 3.4, Theorem 3.10, p.67
[3] Aho, Alfred V.; Corasick, Margaret J. (June 1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM. 18 (6): 333–340.

like image 44
Kate F Avatar answered Oct 12 '22 20:10

Kate F