Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is regex a costly operation? It seems atleast

I was writing a regex pattern for a string. The string being a constant type/structure. What I mean is, it looks like(this format is not so important, have a look at the example next)-

[Data Code Format]: POP N/N/N (N: object): JSON object data

Here N represents a number or digit. and what's inside [ ] is a set of string block. But, this format is constant.

So, I wrote a regex-

\s*((?:\S+\s+){1}\S+)\s*(?:\((?:.*?)\))?:\s*(\S*)\s*(\w+)

Keeping this string example in mind-

%DATA-4-JSON: POP 0/1/0 (1: object): JSON object data

It works perfectly, but, what I see on regex101.com is that there is a successful match. But, it has undergone 330 steps to achieve this.

Screenshot-

image

My question is, its taking 330 steps to achieve this(atleast in my mind I feel its pretty heavy), which I guess can be achieved using if-else and other comparisons with lesser steps right?

I mean, is regex string parsing so heavy? 330 steps for like 10000's of strings I need to parse is going to be heavy right?

like image 964
bozzmob Avatar asked Jan 08 '16 10:01

bozzmob


1 Answers

When you are using regexps, they can be costly if you use backtracking. When you use quantifiers with consequent patterns that may match one another (and in case the patterns between them may match empty strings (usually, declared with *, {0,x} or ? quantifiers)), backtracking might play a bad trick on you.

What is bad about your regex?

A slight performance issue is caused by an unnecessary non-capturing group (?:\S+\s+){1}. It can be written as \S+\s+ and it will already decrease the number of steps (a bit, 302 steps). However, the worst part is that \S matches the : delimiter. Thus, the regex engine has to try a lot of possible routes to match the beginning of your expected match. Replace that \S+\s+\S with [^:\s]+\s+[^:\s]+, and the step amount will decrease to 159!

Now, coming to (?:\((?:.*?)\))? - removing the unnecessary inner non-capturing group gives another slight improvement (148 steps), and replacing .*? with a negated character class gives another boost (139 steps).

I'd use this for now:

\s*([^:\s]+\s+[^:\s]+)\s*(?:\([^()]*\))?:\s*(\S*)\s*(\w+)

If the initial \s* was obligatory, it would be another enhancement, but you would have different matches.

like image 172
Wiktor Stribiżew Avatar answered Sep 28 '22 00:09

Wiktor Stribiżew