I made a regex for port numbers (before you say this is a bad idea, its going into a bigger regex for URL's which is much harder than it sounds).
My coworker said this is really bad and isn't going to catch everything. I disagree.
I believe this thing catches everything from 0 to 65535 and nothing else, and I'm looking for confirmation of this.
Single-line version (for computers):
/(^[0-9]$)|(^[0-9][0-9]$)|(^[0-9][0-9][0-9]$)|(^[0-9][0-9][0-9][0-9]$)|((^[0-5][0-9][0-9][0-9][0-9]$)|(^6[0-4][0-9][0-9][0-9]$)|(^65[0-4][0-9][0-9]$)|(^655[0-2][0-9]$)|(^6553[0-5]$))/
Human readable version:
/(^[0-9]$)| # single digit
(^[0-9][0-9]$)| # two digit
(^[0-9][0-9][0-9]$)| # three digit
(^[0-9][0-9][0-9][0-9]$)| # four digit
((^[0-5][0-9][0-9][0-9][0-9]$)| # five digit (up to 59999)
(^6[0-4][0-9][0-9][0-9]$)| # (up to 64999)
(^65[0-4][0-9][0-9]$)| # (up to 65499)
(^655[0-2][0-9]$)| # (up to 65529)
(^6553[0-5]$))/ # (up to 65535)
Can someone confirm that my understanding is correct (or otherwise)?
A port number is a way to identify a specific process to which an internet or other network message is to be forwarded when it arrives at a server. All network-connected devices come equipped with standardized ports that have an assigned number.
You could shorten it considerably:
^0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$
Drop the leading 0*
if you don't want to allow leading zeroes.
This regex is also better because it matches the special cases (65535, 65001 etc.) first and thus avoids some backtracking.
Oh, and since you said you want to use this as part of a larger regex for URLs, you should then replace both ^
and $
with \b
(word boundary anchors).
Edit: @ceving asked if the repetition of 6553
, 655
, 65
and 6
is really necessary. The answer is no - you can also use a nested regex instead of having to repeat those leading digits. Let's just consider the section
6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}
This can be rewritten as
6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))
I would argue that this makes the regex even less readable than it already was. Verbose mode makes the differences a bit clearer. Compare
6553[0-5] |
655[0-2][0-9] |
65[0-4][0-9]{2} |
6[0-4][0-9]{3}
with
6
(?:
[0-4][0-9]{3}
|
5
(?:
[0-4][0-9]{2}
|
5
(?:
[0-2][0-9]
|
3[0-5]
)
)
)
Some performance measurements: Testing each regex against all numbers from 1 through 99999 shows a minimal, probably irrelevant performance benefit for the nested version:
import timeit
r1 = """import re
regex = re.compile(r"0*(?:6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
r2 = """import re
regex = re.compile(r"0*(?:6(?:[0-4][0-9]{3}|5(?:[0-4][0-9]{2}|5(?:[0-2][0-9]|3[0-5])))|[1-5][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$")"""
stmt = """for i in range(1,100000):
regex.match(str(i))"""
print(timeit.timeit(setup=r1, stmt=stmt, number=100))
print(timeit.timeit(setup=r2, stmt=stmt, number=100))
Output:
7.7265428834649
7.556472630353351
Personally I would match just a number and then I would check with code that the number is in range.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With