Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using explicitly numbered repetition instead of question mark, star and plus

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:

Explicit            Shorthand (something){0,1}    (something)? (something){1}      (something) (something){0,}     (something)* (something){1,}     (something)+ 

The questions are:

  • Are these two forms identical? What if you add possessive/reluctant modifiers?
  • If they are identical, which one is more idiomatic? More readable? Simply "better"?
like image 619
polygenelubricants Avatar asked Jun 13 '10 14:06

polygenelubricants


2 Answers

To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.

The only time I would use explicitly numbered repetition is when the repetition is greater than 1:

  • Exactly two: {2}
  • Two or more: {2,}
  • Two to four: {2,4}

I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.

If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.

like image 116
Ahmad Mageed Avatar answered Sep 28 '22 03:09

Ahmad Mageed


I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:

/^  abc{2,5}  xyz{0,1}  foo{3,12}  bar{1,}  $/x 

But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.

And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.

like image 30
Alan Moore Avatar answered Sep 28 '22 03:09

Alan Moore