Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I optimize this phone-regex?

Tags:

regex

Ok, so I have this regex:

( |^|>)(((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8})))( |$|<)

It formats Dutch and Belgian phone numbers (I only want those hence the 31 and 32 as country code).

Its not much fun to decipher but as you can see it also has a lot duplicated. but now it does handles it very accurately

All the following European formatted phone numbers are accepted

0031201234567
0031223234567
0031612345678
+31(0)20-1234567
+31(0)223-234567
+31(0)6-12345678
020-1234567
0223-234567
06-12345678
0201234567
0223234567
0612345678

and the following false formatted ones are not

06-1234567 (mobile phone number in the Netherlands should have 8 numbers after 06 )
0223-1234567 (area code with home phone)

as opposed to this which is good.

020-1234567 (area code with 3 numbers has 7 numbers for the phone as opposed to a 4 number area code which can only have 6 numbers for phone number)

As you can see it's the '-' character that makes it a little difficult but I need it in there because it's a part of the formatting usually used by people, and I want to be able to parse them all.

Now is my question... do you see a way to simplify this regex (or even improve it if you see a fault in it), while keeping the same rules?

You can test it at regextester.com

(The '( |^|>)' is to check if it is at the start of a word with the possibility it being preceded by either a new line or a '>'. I search for the phone numbers in HTML pages.)

like image 851
youri Avatar asked Nov 06 '08 13:11

youri


1 Answers

First observation: reading the regex is a nightmare. It cries out for Perl's /x mode.

Second observation: there are lots, and lots, and lots of capturing parentheses in the expression (42 if I count correctly; and 42 is, of course, "The Answer to Life, the Universe, and Everything" -- see Douglas Adams "Hitchiker's Guide to the Galaxy" if you need that explained).

Bill the Lizard notes that you use '(-)?( )?' several times. There's no obvious advantage to that compared with '-? ?' or possibly '[- ]?', unless you are really intent on capturing the actual punctuation separately (but there are so many capturing parentheses working out which '$n' items to use would be hard).

So, let's try editing a copy of your one-liner:

( |^|>)
(
    ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7})) |
    ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6})) |
    ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)

OK - now we can see the regular structure of your regular expression.

There's much more analysis possible from here. Yes, there can be vast improvements to the regular expression. The first, obvious, one is to extract the international prefix part, and apply that once (optionally, or require the leading zero) and then apply the national rules.

( |^|>)
(
    (((\+|00)(31|32)( )?(\(0\))?)|0)
    (((([0-9]{2})(-)?( )?)?)([0-9]{7})) |
    (((([0-9]{3})(-)?( )?)?)([0-9]{6})) |
    (((([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)

Then we can simplify the punctuation as noted before, and remove some plausibly redundant parentheses, and improve the country code recognizer:

( |^|>)
(
    (((\+|00)3[12] ?(\(0\))?)|0)
    (((([0-9]{2})-? ?)?)[0-9]{7}) |
    (((([0-9]{3})-? ?)?)[0-9]{6}) |
    (((([0-9]{1})-? ?)?)[0-9]{8})
)
( |$|<)

We can observe that the regex does not enforce the rules on mobile phone codes (so it does not insist that '06' is followed by 8 digits, for example). It also seems to allow the 1, 2 or 3 digit 'exchange' code to be optional, even with an international prefix - probably not what you had in mind, and fixing that removes some more parentheses. We can remove still more parentheses after that, leading to:

( |^|>)
(
    (((\+|00)3[12] ?(\(0\))?)|0)    # International prefix or leading zero
    ([0-9]{2}-? ?[0-9]{7}) |        # xx-xxxxxxx
    ([0-9]{3}-? ?[0-9]{6}) |        # xxx-xxxxxx
    ([0-9]{1}-? ?[0-9]{8})          # x-xxxxxxxx
)
( |$|<)

And you can work out further optimizations from here, I'd hope.

like image 90
Jonathan Leffler Avatar answered Nov 01 '22 21:11

Jonathan Leffler