Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is regex search slower with capturing groups in Python?

Tags:

I have an application code which generates regexes dynamically from a config for some parsing. When timing performance of two variations, the regex variation with each part of an OR regex being captured is noticably slow than a normal regex. The reason would be overhead of certain operations internally in regex module.

>>> import timeit
>>> setup = '''
... import re
... '''   

#no capture group 
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.922958850861

#with capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.44321084023

#no capture group
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.913202047348

# capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.41544604301

Question: What causes this considerable drop in performance when using capture groups ?

like image 920
DhruvPathak Avatar asked Jan 03 '17 13:01

DhruvPathak


People also ask

Why is Python regex slow?

Basically, the issue is with the doubled repetition of the a in the pattern. The part a* allows for any number of a 's, while the surrounding (·)* also allows any number of the contained pattern. This allows for a huge number of possible ways to match the pattern against a string of a 's.

What is the purpose of regex capture groups?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

How do you speed up re search in Python?

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries. This should be faster, because to process a sentence, you just have to step through each of the words and check if it's a match.

How do I make regular expressions faster?

Expose Literal Characters Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.


2 Answers

The reason is pretty simple, using capturing groups indicate the Engine to save the content in memory, while using non capturing group indicates the engine to not save anything. Consider that you are telling the engine to perform more operations.

For instance, using this regex (hello|bye|ola|cheers) or (hello)|(bye)|(ola)|(cheers) will impact considerably higher than using an atomic group or a non capturing one like (?:hello|bye|ola|cheers).

When using regex you know if you want to capture or not capture content like the case above. If you want to capture any of those words, you will lose performance but if you don't need to capture content then you can save performance by improving it like using non-capturing groups

I know you tagged python, but have have prepared an online benchmark for javascript to show how capturing and non-capturing groups impacts in the js regex engine.

https://jsperf.com/capturing-groups-vs-non-capturing-groups

enter image description here

like image 65
Federico Piazza Avatar answered Sep 21 '22 09:09

Federico Piazza


Your patterns only differ in the capturing groups. When you define a capturing group in the regex pattern and use the pattern with re.search, the result will be a MatchObject instance. Each match object will contain as many groups as there are capturing groups in the pattern, even if they are empty. That is the overhead for the re internals: adding the (list of) groups (memory allocation, etc.). Mind that groups also contain such details as the starting and ending index of the text that they match and more (refer to the MatchObject reference).

like image 40
Wiktor Stribiżew Avatar answered Sep 21 '22 09:09

Wiktor Stribiżew