Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a compelling reason to use quantifiers in Perl regular expressions instead of just repeating the character?

Tags:

regex

perl

I was performing a code review for a colleague and he had a regular expression that looked like this:

if ($value =~ /^\d\d\d\d$/) {
    #do stuff
}

I told him he should change it to:

if ($value =~ /^\d{4}$/) {
    #do stuff
}

To which he replied that he preferred the first for readability (I find the second more readable, but that's a religious debate I'll save for another day).

My question: is there an actual benefit to one over the other?

like image 641
Morinar Avatar asked Mar 30 '10 18:03

Morinar


People also ask

How are quantifiers used in regular expressions?

quantifier matches the preceding element zero or more times but as few times as possible. It's the lazy counterpart of the greedy quantifier * . In the following example, the regular expression \b\w*? oo\w*?\

What does \d mean in Perl?

The Special Character Classes in Perl are as follows: Digit \d[0-9]: The \d is used to match any digit character and its equivalent to [0-9]. In the regex /\d/ will match a single digit. The \d is standardized to “digit”.

What is the meaning of $1 in Perl regex?

$1 equals the text " brown ".

What does \s mean in Perl?

In addition, Perl defines the following: \w Match a "word" character (alphanumeric plus "_") \W Match a non-word character \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character.


2 Answers

There's no such thing as absolute readability. There's what people can individually recognize, which is why people often understand their code while nobody else can. If he never uses quantifiers, he's always going to think quantifiers are hard to read because he never learns to grok them.

I most often find that people say "more readable" when they really mean "that's what I know already" or "that's what I wrote the first time". That's not necessarily the case here, though.

An absolute quantifier like {4} is just easier to specify and communicate to other programmers. Who wants to count the number of \ds by hand? You write code for other people to read, so don't make their life harder.

However, you might have missed the bug in that code because you were focused on the quantifier issue. The $ anchor allows a newline at the end of the string, and if a Perl Best Practices zealot comes along and blindly adds /xsm to all regexes (a painful experience I've seen more than a few times), that $ allows even more invalid output. You probably want the \z absolute end-of-string anchor instead.

Not that it happened in your case, but code reviews tend to turn into style or syntax reviews (because those are easier to notice) and actually miss the point of checking for proper and intended behavior and correct design. Often the style problems aren't worth worrying about considering all of the other ways you could spend time to improve code. :)

like image 56
brian d foy Avatar answered Sep 28 '22 08:09

brian d foy


They do the exact same thing, so as far as practicality it's a matter of preference. Is there a tiny performance difference one way or the other? Who knows but it's surely insignificant.

The quantifiers are more useful (and required) when the pattern length isn't fixed, for example \d{12,16}, \d{2,}, etc.

I prefer \d{4} which is easier for my brain to parse than \d\d\d\d

Also what if you're matching a character class rather than a simple digit? [aeiouy0-9]{4} or [aeiouy0-9][aeiouy0-9][aeiouy0-9][aeiouy0-9] ?

like image 31
Rob Avatar answered Sep 28 '22 07:09

Rob