Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a character class faster than alternation?

It seems that using a character class is faster than the alternation in an example like:
[abc] vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower).
Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?

like image 692
Jim Avatar asked Mar 02 '14 19:03

Jim


People also ask

How do you express alternation or in a regular expression?

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping.

What is character class in regex?

In the context of regular expressions, a character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.

Which regex symbol is used in alternation?

Alternation is the term in regular expression that is actually a simple “OR”. In a regular expression it is denoted with a vertical line character | . For instance, we need to find programming languages: HTML, PHP, Java or JavaScript.


2 Answers

This is because the "OR" construct | backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:

Pattern: (r|f)at
Match string: carat

alternations

Pattern: [rf]at
Match string: carat

class


But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.

like image 172
Unihedron Avatar answered Oct 17 '22 08:10

Unihedron


Because a character class like [abc] is irreducable and can be optimised, whereas an alternation like (?:a|b|c) may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t).

The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.

There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".

like image 44
Borodin Avatar answered Oct 17 '22 08:10

Borodin