Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a parser for regular expressions

Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.

So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.

To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.

The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.

EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.

like image 568
Chinmay Kanchi Avatar asked Sep 03 '10 21:09

Chinmay Kanchi


People also ask

What is a regex parser?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.

Can regex parse regex?

No. You need context-free grammar to parse regular expression. Nested parentheses can't be parsed with (theoretical) regular expression.

How do I write my own regular expression?

Example : The regular expression ab+c will give abc, abbc, abbc, … and so on. The curly braces {…}: It tells the computer to repeat the preceding character (or set of characters) for as many times as the value inside this bracket.

Do parsers use regex?

Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.


1 Answers

Writing an implementation of a regular expression engine is indeed a quite complex task.

But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:

Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

like image 175
Mark Byers Avatar answered Oct 22 '22 03:10

Mark Byers