This may be a problem that has already been solved, but I can't figure it out. I have two largish integers, lets call them start_number
and end_number
(they represent a contiguous block of telephone numbers). Other numbers (represented as strings) will be input into my system and I need to use regex to match this against the 'range regex' to see if the number string falls either on or between start_number
and end_number
.
For example:
start_number = 99519000
end_number = 99519099
therefore
expression = "^995190[0-9][0-9]$"
so that I can eventually match the following example numbers (which arrive at my system one at a time, and could arrive at any time):
"99519000"
<- MATCH"99519055"
<- MATCH"99519099"
<- MATCH"99519100"
<- NOT MATCH"99512210"
<- NOT MATCH"41234123"
<- NOT MATCHHow can I use python to create the regex string pattern "expression
" given any reasonable start_number
and end_number
? I have several start/end number 'blocks' I have to create regex patterns for, I just need a way to make these patterns programatically.
It's fair to assume that:
Start_number
will always be less than end_number
start_number
and end_number
will always be the same 'length' (i.e. always will have the same number of base 10 'characters' when represented as a string), if it'll make life easier.Edit: for clarity
[assuming you need this because it's some weird 3rd party system that requires regexp]
New Approach
the more i think about Frederik's comment, the more i agree. the regexp engine should be able to compile this down to a compact DFA, even if the input string is long. for many cases, the following is a sensible solution:
import re
def regexp(lo, hi):
fmt = '%%0%dd' % len(str(hi))
return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))
(it works fine with all the numerical ranges in the tests below, including 99519000 - 99519099. a rough back-of-the-envelope calculation suggests that 9 digit numbers are about the limit with 1GB of memory. that's if most numbers that size are matched; if only a few are matched you can go much larger.).
Old Approach
[updated again to give even shorter results - apart from coalescing the occasional \d\d
it's about as good as hand-generated]
assuming all numbers are the same length (ie you zero pad on the left if necessary), this works:
import re
def alt(*args):
'''format regexp alternatives'''
if len(args) == 1: return args[0]
else: return '(%s)' % '|'.join(args)
def replace(s, c):
'''replace all characters in a string with a different character'''
return ''.join(map(lambda x: c, s))
def repeat(s, n):
'''format a regexp repeat'''
if n == 0: return ''
elif n == 1: return s
else: return '%s{%d}' % (s, n)
def digits(lo, hi):
'''format a regexp digit range'''
if lo == 0 and hi == 9: return r'\d'
elif lo == hi: return str(lo)
else: return '[%d-%d]' % (lo, hi)
def trace(f):
'''for debugging'''
def wrapped(lo, hi):
result = f(lo, hi)
print(lo, hi, result)
return result
return wrapped
#@trace # uncomment to get calls traced to stdout (explains recursion when bug hunting)
def regexp(lo, hi):
'''generate a regexp that matches integers from lo to hi only.
assumes that inputs are zero-padded to the length of hi (like phone numbers).
you probably want to surround with ^ and $ before using.'''
assert lo <= hi
assert lo >= 0
slo, shi = str(lo), str(hi)
# zero-pad to same length
while len(slo) < len(shi): slo = '0' + slo
# first digits and length
l, h, n = int(slo[0]), int(shi[0]), len(slo)
if l == h:
# extract common prefix
common = ''
while slo and slo[0] == shi[0]:
common += slo[0]
slo, shi = slo[1:], shi[1:]
if slo: return common + regexp(int(slo), int(shi))
else: return common
else:
# the core of the routine.
# split into 'complete blocks' like 200-599 and 'edge cases' like 123-199
# and handle each separately.
# are these complete blocks?
xlo = slo[1:] == replace(slo[1:], '0')
xhi = shi[1:] == replace(shi[1:], '9')
# edges of possible complete blocks
mlo = int(slo[0] + replace(slo[1:], '9'))
mhi = int(shi[0] + replace(shi[1:], '0'))
if xlo:
if xhi:
# complete block on both sides
# this is where single digits are finally handled, too.
return digits(l, h) + repeat('\d', n-1)
else:
# complete block to mhi, plus extra on hi side
prefix = '' if l or h-1 else '0'
return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi))
else:
prefix = '' if l else '0'
if xhi:
# complete block on hi side plus extra on lo
return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi))
else:
# neither side complete, so add extra on both sides
# (and maybe a complete block in the middle, if room)
if mlo + 1 == mhi:
return alt(prefix + regexp(lo, mlo), regexp(mhi, hi))
else:
return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi))
# test a bunch of different ranges
for (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101),
(1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101),
(0, 123), (111, 123), (123, 222), (123, 333), (123, 444),
(0, 321), (111, 321), (222, 321), (321, 333), (321, 444),
(123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999),
(99519000, 99519099)]:
fmt = '%%0%dd' % len(str(hi))
rx = regexp(lo, hi)
print('%4s - %-4s %s' % (fmt % lo, fmt % hi, rx))
m = re.compile('^%s$' % rx)
for i in range(0, 1+int(replace(str(hi), '9'))):
if m.match(fmt % i):
assert lo <= i <= hi, i
else:
assert i < lo or i > hi, i
the function regexp(lo, hi)
builds a regexp that matches values between lo
and hi
(zero padded to the maximum length). you probably need to put a ^
before and a $
after (as in the test code) to force the match to be the whole string.
the algorithm is actually quite simple - it recursively divides things into common prefixes and "complete blocks". a complete block is something like 200-599 and can be matched reliably (in this case with [2-5]\d{2}
).
so 123-599 is split into 123-199 and 200-599. the last half is a complete block, the first half has a common prefix of 1 and 23-99, which is recursively handled as 23-29 (common prefix) and 30-99 (complete block) (and we terminate eventually, because arguments to each call are shorter than the initial input).
the only nasty detail is the prefix
, which is needed because the arguments to regexp()
are integers, so when called to generate, say, the regexp for 00-09 it actually generates the regexp for 0-9, without the leading 0.
the output is a bunch of test cases, showing the range and regexp:
0 - 0 0
0 - 1 [0-1]
0 - 2 [0-2]
0 - 9 \d
00 - 10 (0\d|10)
00 - 11 (0\d|1[0-1])
000 - 101 (0\d\d|10[0-1])
1 - 1 1
1 - 2 [1-2]
1 - 9 [1-9]
01 - 10 (0[1-9]|10)
01 - 11 (0[1-9]|1[0-1])
001 - 101 (0(0[1-9]|[1-9]\d)|10[0-1])
000 - 123 (0\d\d|1([0-1]\d|2[0-3]))
111 - 123 1(1[1-9]|2[0-3])
123 - 222 (1(2[3-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
123 - 333 (1(2[3-9]|[3-9]\d)|2\d\d|3([0-2]\d|3[0-3]))
123 - 444 (1(2[3-9]|[3-9]\d)|[2-3]\d{2}|4([0-3]\d|4[0-4]))
000 - 321 ([0-2]\d{2}|3([0-1]\d|2[0-1]))
111 - 321 (1(1[1-9]|[2-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
222 - 321 (2(2[2-9]|[3-9]\d)|3([0-1]\d|2[0-1]))
321 - 333 3(2[1-9]|3[0-3])
321 - 444 (3(2[1-9]|[3-9]\d)|4([0-3]\d|4[0-4]))
123 - 321 (1(2[3-9]|[3-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
111 - 121 1(1[1-9]|2[0-1])
121 - 222 (1(2[1-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
1234 - 4321 (1(2(3[4-9]|[4-9]\d)|[3-9]\d{2})|[2-3]\d{3}|4([0-2]\d{2}|3([0-1]\d|2[0-1])))
000 - 999 \d\d{2}
99519000 - 99519099 995190\d\d
it takes a while to run as the last test loops over 99999999 numbers.
the expressions should be compact enough to avoid any buffer limits (i would guess that memory size worst case is proportional to the square of the number of digits in the largest number).
ps i am using python 3, but i don't think it makes much difference here.
You can install this package by using pip
pip install regex-engine
from regex_engine import generator
generate = generator()
regex = generate.numerical_range(99519000, 99519099)
print(regex)
^(995190[1-8][0-9]|9951900[0-9]|9951909[0-9])$
You can also generate regexes for floating point and negative ranges
from regex_engine import generator
generate = generator()
regex1 = generate.numerical_range(5,89)
regex2 = generate.numerical_range(81.78,250.23)
regex3 = generate.numerical_range(-65,12)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With