Regular expression implementation details

Tags:

regex

A question that I answered got me wondering:

How are regular expressions implemented in Python? What sort of efficiency guarantees are there? Is the implementation "standard", or is it subject to change?

I thought that regular expressions would be implemented as DFAs, and therefore were very efficient (requiring at most one scan of the input string). Laurence Gonsalves raised an interesting point that not all Python regular expressions are regular. (His example is r"(a+)b\1", which matches some number of a's, a b, and then the same number of a's as before). This clearly cannot be implemented with a DFA.

So, to reiterate: what are the implementation details and guarantees of Python regular expressions?

It would also be nice if someone could give some sort of explanation (in light of the implementation) as to why the regular expressions "cat|catdog" and "catdog|cat" lead to different search results in the string "catdog", as mentioned in the question that I referenced before.

512

asked May 09 '09 22:05

Tom

2 Answers

Python's re module was based on PCRE, but has moved on to their own implementation.

Here is the link to the C code.

It appears as though the library is based on recursive backtracking when an incorrect path has been taken.

alt text

Regular expression and text size n
a?ⁿaⁿ matching aⁿ

Keep in mind that this graph is not representative of normal regex searches.

http://swtch.com/~rsc/regexp/regexp1.html

198

answered Oct 08 '22 07:10

Unknown

There are no "efficiency guarantees" on Python REs any more than on any other part of the language (C++'s standard library is the only widespread language standard I know that tries to establish such standards -- but there are no standards, even in C++, specifying that, say, multiplying two ints must take constant time, or anything like that); nor is there any guarantee that big optimizations won't be applied at any time.

Today, F. Lundh (originally responsible for implementing Python's current RE module, etc), presenting Unladen Swallow at Pycon Italia, mentioned that one of the avenues they'll be exploring is to compile regular expressions directly to LLVM intermediate code (rather than their own bytecode flavor to be interpreted by an ad-hoc runtime) -- since ordinary Python code is also getting compiled to LLVM (in a soon-forthcoming release of Unladen Swallow), a RE and its surrounding Python code could then be optimized together, even in quite aggressive ways sometimes. I doubt anything like that will be anywhere close to "production-ready" very soon, though;-).

answered Oct 08 '22 07:10

Alex Martelli

Related questions
                            
                                Using getattr in Jinja2 gives me an error (jinja2.exceptions.UndefinedError: 'getattr' is undefined)
                            
                                Getting csv.Sniffer to work with quoted values
                            
                                How to access Enum types in Django templates
                            
                                Django rest auth email instead of username
                            
                                Calculate max draw down with a vectorized solution in python
                            
                                read_csv doesn't read the column names correctly on this file?
                            
                                How to extract subjects in a sentence and their respective dependent phrases?
                            
                                How to have actual values in matplotlib Pie Chart displayed
                            
                                Python __attrs__ explained
                            
                                Panda Python - dividing a column by 100 (then rounding by 2.dp)
                            
                                keras - cannot import name Conv2D
                            
                                Group duplicate column IDs in pandas dataframe
                            
                                Use dictionary to replace a string within a string in Pandas columns
                            
                                PyInstaller WARNING: lib not found
                            
                                Does Kafka python API support stream processing?
                            
                                Django one of 2 fields must not be null
                            
                                Ansible + Ubuntu 18.04 + MySQL = "The PyMySQL (Python 2.7 and Python 3.X) or MySQL-python (Python 2.X) module is required."
                            
                                What is the difference between MaxPool and MaxPooling layers in Keras?
                            
                                Determine if a named parameter was passed
                            
                                Embedding icon in .exe with py2exe, visible in Vista?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With