Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently match multiple regexes in Python

Lexical analyzers are quite easy to write when you have regexes. Today I wanted to write a simple general analyzer in Python, and came up with:

import re
import sys

class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        self.rules = []

        for regex, type in rules:
            self.rules.append((re.compile(regex), type))

        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf[self.pos:])

                if m:
                    self.pos += m.start()
                else:
                    return None

            for token_regex, token_type in self.rules:
                m = token_regex.match(self.buf[self.pos:])

                if m:
                    value = self.buf[self.pos + m.start():self.pos + m.end()]
                    tok = Token(token_type, value, self.pos)
                    self.pos += m.end()
                    return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print tok
    except LexerError, err:
        print 'LexerError at position', err.pos

It works just fine, but I'm a bit worried that it's too inefficient. Are there any regex tricks that will allow me to write it in a more efficient / elegant way ?

Specifically, is there a way to avoid looping over all the regex rules linearly to find one that fits?

like image 937
Eli Bendersky Avatar asked Sep 25 '08 15:09

Eli Bendersky


People also ask

How do you match a pattern in Python?

Introducing Python structural pattern matchingThe match/case statement follows the same basic outline as switch/case . It takes an object, tests the object against one or more match patterns, and takes an action if it finds a match. Each case statement is followed by a pattern to match against.

What does match () do in Python?

match() both are functions of re module in python. These functions are very efficient and fast for searching in strings. The function searches for some substring in a string and returns a match object if found, else it returns none.

What is r in regex Python?

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.


1 Answers

I suggest using the re.Scanner class, it's not documented in the standard library, but it's well worth using. Here's an example:

import re

scanner = re.Scanner([
    (r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
    (r"-?[0-9]+", lambda scanner, token: int(token)),
    (r" +", lambda scanner, token: None),
])

>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]
like image 77
dan_waterworth Avatar answered Oct 16 '22 03:10

dan_waterworth