Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to shorten this regular expression?

Tags:

regex

ruby

I want to match strings in the format of A0123456, E0123456, or IN:A0123456Q, etc. I originally made this regex

^(IN:)?[AE][0-9]{7}Q?$

but it was matching IN:E012346 without the Q at the end. So I created this regex

(^IN:[AE][0-9]{7}Q$)|(^[AE][0-9]{7}$)

Is there any way to shorten this regex so that it requires both IN: and Q if they are present, but not if neither are present?

Edit: The regex will be used in Ruby.

Edit 2: I changed the regex to reflect that I was matching the wrong strings, as it would still match IN:A0123456.

Edit 3: Both answers below are valid, but since I am using Ruby 2.0 and prefer a regex expression I can use in case I change my application and don't want to use the Ruby flavor of subexpression calls, I chose to accept matt's answer.

like image 627
josh Avatar asked Sep 11 '13 22:09

josh


2 Answers

The second regex has problem:

^(IN:[AE][0-9]{7}Q)|([AE][0-9]{7})$

The | has lower precedence than concatenation, so the regex will be parsed as:

^(IN:[AE][0-9]{7}Q)        # Starts with (IN:[AE][0-9]{7}Q)
|                          # OR
([AE][0-9]{7})$            # Ends with ([AE][0-9]{7})

To fix this problem, just use a non-capturing group:

^(?:(IN:[AE][0-9]{7}Q)|([AE][0-9]{7}))$

It makes sure the input string matches either format, not just starting or ending with certain format (which is clearly incorrect).


Regarding shortening the regex, you may replace [0-9] with \d if you want to, but it is fine as it is.

I don't think there is any other way to shorten the regex within the default level of support of Ruby.

Subroutine call

Just for your information, in Perl/PCRE, you can shorten it with subroutine call:

^(?:([AE][0-9]{7})|(IN:(?1)Q))$

(?1) refers to the pattern defined by the first capturing group, i.e. [AE][0-9]{7}. The regex is effectively the same, just look shorter. This demo with input IN:E0123463Q shows the whole text being captured by group 2 (and no text captured for group 1).


In Ruby, a similar concept subexpression call exists, with slightly different syntax. Ruby uses \g<name> or \g<number> to refer to the capturing group whose pattern we want to reuse:

^(?:([AE][0-9]{7})|(IN:\g<1>Q))$

The test case here on rubular under Ruby 1.9.7, for input IN:E0123463Q, returns E0123463 as match for group 1 and IN:E0123463Q as match for group 2.

Ruby's (1.9.7) implementation seems to record the captured text for group 1 even when group 1 is not directly involved in the matching. In PCRE, subroutine calls does not capture text.

Conditional regex

There is also conditional regex that allows you to check whether a certain capturing group matches something or not. You can check matt's answer for more information.

like image 97
nhahtdh Avatar answered Nov 17 '22 10:11

nhahtdh


If you are using Ruby 2.0, you can use an if-then-else conditional match (undocumented in the Ruby docs, but does exist):

/^(IN:)?[AE][0-9]{7}(?(1)Q|)$/

The conditional part is (?(1)Q|) which says if group number 1 matched, then match Q, else match nothing. Since group number 1 is (IN:), this achieves what you want.

like image 3
matt Avatar answered Nov 17 '22 10:11

matt