Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anyone see anything wrong with my regex for port numbers?

Tags:

regex

I made a regex for port numbers (before you say this is a bad idea, its going into a bigger regex for URL's which is much harder than it sounds).

My coworker said this is really bad and isn't going to catch everything. I disagree.

I believe this thing catches everything from 0 to 65535 and nothing else, and I'm looking for confirmation of this.

Single-line version (for computers):

/(^[0-9]$)|(^[0-9][0-9]$)|(^[0-9][0-9][0-9]$)|(^[0-9][0-9][0-9][0-9]$)|((^[0-5][0-9][0-9][0-9][0-9]$)|(^6[0-4][0-9][0-9][0-9]$)|(^65[0-4][0-9][0-9]$)|(^655[0-2][0-9]$)|(^6553[0-5]$))/

Human readable version:

/(^[0-9]$)|                           # single digit
 (^[0-9][0-9]$)|                      # two digit
 (^[0-9][0-9][0-9]$)|                 # three digit
 (^[0-9][0-9][0-9][0-9]$)|            # four digit
 ((^[0-5][0-9][0-9][0-9][0-9]$)|      # five digit (up to 59999)
  (^6[0-4][0-9][0-9][0-9]$)|          #            (up to 64999)
  (^65[0-4][0-9][0-9]$)|              #            (up to 65499)
  (^655[0-2][0-9]$)|                  #            (up to 65529)
  (^6553[0-5]$))/                     #            (up to 65535)

Can someone confirm that my understanding is correct (or otherwise)?

like image 883
asdadas Avatar asked Sep 15 '10 06:09

asdadas


People also ask

Why do we need to use port number when communicating with a remote application?

A port number is a way to identify a specific process to which an internet or other network message is to be forwarded when it arrives at a server. All network-connected devices come equipped with standardized ports that have an assigned number.


2 Answers

You could shorten it considerably:

^0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$
  • no need to repeat the anchors every single time
  • no need for lots of capturing groups
  • no need to spell out repetitions.

Drop the leading 0* if you don't want to allow leading zeroes.

This regex is also better because it matches the special cases (65535, 65001 etc.) first and thus avoids some backtracking.

Oh, and since you said you want to use this as part of a larger regex for URLs, you should then replace both ^ and $ with \b (word boundary anchors).


Edit: @ceving asked if the repetition of 6553, 655, 65 and 6 is really necessary. The answer is no - you can also use a nested regex instead of having to repeat those leading digits. Let's just consider the section

6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}

This can be rewritten as

6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))

I would argue that this makes the regex even less readable than it already was. Verbose mode makes the differences a bit clearer. Compare

6553[0-5]       |
655[0-2][0-9]   |
65[0-4][0-9]{2} |
6[0-4][0-9]{3}  

with

6 
(?:
 [0-4][0-9]{3}
|
 5
 (?:
  [0-4][0-9]{2}
 |
  5
  (?:
   [0-2][0-9]
  |
   3[0-5]
  )
 )
)

Some performance measurements: Testing each regex against all numbers from 1 through 99999 shows a minimal, probably irrelevant performance benefit for the nested version:

import timeit

r1 = """import re
regex = re.compile(r"0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""

r2 = """import re
regex = re.compile(r"0*(?:6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""

stmt = """for i in range(1,100000):
    regex.match(str(i))"""

print(timeit.timeit(setup=r1, stmt=stmt, number=100))
print(timeit.timeit(setup=r2, stmt=stmt, number=100))

Output:

7.7265428834649
7.556472630353351
like image 155
Tim Pietzcker Avatar answered Sep 28 '22 15:09

Tim Pietzcker


Personally I would match just a number and then I would check with code that the number is in range.

like image 29
gpeche Avatar answered Sep 28 '22 16:09

gpeche