Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: number range to regex matching string

Tags:

python

regex

This may be a problem that has already been solved, but I can't figure it out. I have two largish integers, lets call them start_number and end_number (they represent a contiguous block of telephone numbers). Other numbers (represented as strings) will be input into my system and I need to use regex to match this against the 'range regex' to see if the number string falls either on or between start_number and end_number.

For example:

  • start_number = 99519000
  • end_number = 99519099

therefore

  • expression = "^995190[0-9][0-9]$"

so that I can eventually match the following example numbers (which arrive at my system one at a time, and could arrive at any time):

  • "99519000" <- MATCH
  • "99519055" <- MATCH
  • "99519099" <- MATCH
  • "99519100" <- NOT MATCH
  • "99512210" <- NOT MATCH
  • "41234123" <- NOT MATCH

How can I use python to create the regex string pattern "expression" given any reasonable start_number and end_number? I have several start/end number 'blocks' I have to create regex patterns for, I just need a way to make these patterns programatically.

It's fair to assume that:

  • Start_number will always be less than end_number
  • Always will be a positive integer.
  • In my case, start_number and end_number will always be the same 'length' (i.e. always will have the same number of base 10 'characters' when represented as a string), if it'll make life easier.

Edit: for clarity

like image 571
TC Fox Avatar asked Jul 24 '13 09:07

TC Fox


2 Answers

[assuming you need this because it's some weird 3rd party system that requires regexp]

New Approach

the more i think about Frederik's comment, the more i agree. the regexp engine should be able to compile this down to a compact DFA, even if the input string is long. for many cases, the following is a sensible solution:

import re

def regexp(lo, hi):
    fmt = '%%0%dd' % len(str(hi))
    return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))

(it works fine with all the numerical ranges in the tests below, including 99519000 - 99519099. a rough back-of-the-envelope calculation suggests that 9 digit numbers are about the limit with 1GB of memory. that's if most numbers that size are matched; if only a few are matched you can go much larger.).

Old Approach

[updated again to give even shorter results - apart from coalescing the occasional \d\d it's about as good as hand-generated]

assuming all numbers are the same length (ie you zero pad on the left if necessary), this works:

import re

def alt(*args):
    '''format regexp alternatives'''
    if len(args) == 1: return args[0]
    else: return '(%s)' % '|'.join(args)

def replace(s, c): 
     '''replace all characters in a string with a different character'''
    return ''.join(map(lambda x: c, s))

def repeat(s, n):
    '''format a regexp repeat'''
    if n == 0: return ''
    elif n == 1: return s
    else: return '%s{%d}' % (s, n)

def digits(lo, hi): 
    '''format a regexp digit range'''
    if lo == 0 and hi == 9: return r'\d'
    elif lo == hi: return str(lo)
    else: return '[%d-%d]' % (lo, hi)

def trace(f):
    '''for debugging'''
    def wrapped(lo, hi):
        result = f(lo, hi)
        print(lo, hi, result)
        return result
    return wrapped

#@trace  # uncomment to get calls traced to stdout (explains recursion when bug hunting)
def regexp(lo, hi):
    '''generate a regexp that matches integers from lo to hi only.
       assumes that inputs are zero-padded to the length of hi (like phone numbers).
       you probably want to surround with ^ and $ before using.'''

    assert lo <= hi
    assert lo >= 0

    slo, shi = str(lo), str(hi)
    # zero-pad to same length
    while len(slo) < len(shi): slo = '0' + slo
    # first digits and length
    l, h, n = int(slo[0]), int(shi[0]), len(slo)

    if l == h:
        # extract common prefix
        common = ''
        while slo and slo[0] == shi[0]:
            common += slo[0]
            slo, shi = slo[1:], shi[1:]
        if slo: return common + regexp(int(slo), int(shi))
        else: return common

    else:
        # the core of the routine.
        # split into 'complete blocks' like 200-599 and 'edge cases' like 123-199
        # and handle each separately.

        # are these complete blocks?
        xlo = slo[1:] == replace(slo[1:], '0')
        xhi = shi[1:] == replace(shi[1:], '9')

        # edges of possible complete blocks
        mlo = int(slo[0] + replace(slo[1:], '9'))
        mhi = int(shi[0] + replace(shi[1:], '0'))

        if xlo:
            if xhi:
                # complete block on both sides
                # this is where single digits are finally handled, too.
                return digits(l, h) + repeat('\d', n-1)
            else:
                # complete block to mhi, plus extra on hi side
                prefix = '' if l or h-1 else '0'
                return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi))
        else:
            prefix = '' if l else '0'
            if xhi:
                # complete block on hi side plus extra on lo
                return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi))
            else:
                # neither side complete, so add extra on both sides
                # (and maybe a complete block in the middle, if room)
                if mlo + 1 == mhi:
                    return alt(prefix + regexp(lo, mlo), regexp(mhi, hi))
                else:
                    return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi))


# test a bunch of different ranges
for (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101),
                 (1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101),
                 (0, 123), (111, 123), (123, 222), (123, 333), (123, 444),
                 (0, 321), (111, 321), (222, 321), (321, 333), (321, 444),
                 (123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999),
                 (99519000, 99519099)]:
    fmt = '%%0%dd' % len(str(hi))
    rx = regexp(lo, hi)
    print('%4s - %-4s  %s' % (fmt % lo, fmt % hi, rx))
    m = re.compile('^%s$' % rx)
    for i in range(0, 1+int(replace(str(hi), '9'))):
        if m.match(fmt % i):
            assert lo <= i <= hi, i
        else:
            assert i < lo or i > hi, i

the function regexp(lo, hi) builds a regexp that matches values between lo and hi (zero padded to the maximum length). you probably need to put a ^ before and a $ after (as in the test code) to force the match to be the whole string.

the algorithm is actually quite simple - it recursively divides things into common prefixes and "complete blocks". a complete block is something like 200-599 and can be matched reliably (in this case with [2-5]\d{2}).

so 123-599 is split into 123-199 and 200-599. the last half is a complete block, the first half has a common prefix of 1 and 23-99, which is recursively handled as 23-29 (common prefix) and 30-99 (complete block) (and we terminate eventually, because arguments to each call are shorter than the initial input).

the only nasty detail is the prefix, which is needed because the arguments to regexp() are integers, so when called to generate, say, the regexp for 00-09 it actually generates the regexp for 0-9, without the leading 0.

the output is a bunch of test cases, showing the range and regexp:

   0 - 0     0
   0 - 1     [0-1]
   0 - 2     [0-2]
   0 - 9     \d
  00 - 10    (0\d|10)
  00 - 11    (0\d|1[0-1])
 000 - 101   (0\d\d|10[0-1])
   1 - 1     1
   1 - 2     [1-2]
   1 - 9     [1-9]
  01 - 10    (0[1-9]|10)
  01 - 11    (0[1-9]|1[0-1])
 001 - 101   (0(0[1-9]|[1-9]\d)|10[0-1])
 000 - 123   (0\d\d|1([0-1]\d|2[0-3]))
 111 - 123   1(1[1-9]|2[0-3])
 123 - 222   (1(2[3-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
 123 - 333   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-2]\d|3[0-3]))
 123 - 444   (1(2[3-9]|[3-9]\d)|[2-3]\d{2}|4([0-3]\d|4[0-4]))
 000 - 321   ([0-2]\d{2}|3([0-1]\d|2[0-1]))
 111 - 321   (1(1[1-9]|[2-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 222 - 321   (2(2[2-9]|[3-9]\d)|3([0-1]\d|2[0-1]))
 321 - 333   3(2[1-9]|3[0-3])
 321 - 444   (3(2[1-9]|[3-9]\d)|4([0-3]\d|4[0-4]))
 123 - 321   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 111 - 121   1(1[1-9]|2[0-1])
 121 - 222   (1(2[1-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
1234 - 4321  (1(2(3[4-9]|[4-9]\d)|[3-9]\d{2})|[2-3]\d{3}|4([0-2]\d{2}|3([0-1]\d|2[0-1])))
 000 - 999   \d\d{2}
99519000 - 99519099  995190\d\d

it takes a while to run as the last test loops over 99999999 numbers.

the expressions should be compact enough to avoid any buffer limits (i would guess that memory size worst case is proportional to the square of the number of digits in the largest number).

ps i am using python 3, but i don't think it makes much difference here.

like image 52
andrew cooke Avatar answered Sep 20 '22 20:09

andrew cooke


Use the python package regex_engine for generate regular expressions for numerical ranges

You can install this package by using pip

pip install regex-engine

from regex_engine import generator

generate = generator()

regex = generate.numerical_range(99519000, 99519099)

print(regex)

^(995190[1-8][0-9]|9951900[0-9]|9951909[0-9])$

You can also generate regexes for floating point and negative ranges

from regex_engine import generator

generate = generator()

regex1 = generate.numerical_range(5,89)
regex2 = generate.numerical_range(81.78,250.23)
regex3 = generate.numerical_range(-65,12)
like image 38
ASHWIN RAJEEV Avatar answered Sep 16 '22 20:09

ASHWIN RAJEEV