Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generalization for regular expression on any list

Tags:

regex

I'm working with a large list containing integers and I would like to do some pattern matching on them (like finding certain sequences). Regular expressions would be the perfect fit, except that they always seem to only handle lists of characters, a.k.a. strings. Is there any library (in any language) that can handle lists of an arbitrary type?

I'm aware that I could convert my integer list into a string and then do a normal regular expression search but that seems a bit wasteful and inelegant.

edit:

My requirements are fairly simple. No need for nested lists, no need for fancy character classes. Basically I'm just interested in occurrences of sequences that can get pretty complicated. (e.g. something like "[abc]{3,}.([ab]?[^a]{4,7})" etc. where a,b,c are integers). This should be possible to generalize over any type which can be checked for equality. For an enumerable type you could also get things like "[a-z]" to work.

like image 666
pafcu Avatar asked Nov 29 '10 11:11

pafcu


People also ask

What is generalized regular expression?

Regular expressions are a generalized way to match patterns with sequences of characters. It is used in every programming language like C++, Java and Python.

Which are 3 uses of regular expression?

Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.

Why * is used in regex?

* - means "0 or more instances of the preceding regex token"

What will the \$ regular expression match?

\$ will help to find the character "$" available in the content based on the expression flags assigned to the regular expression.


2 Answers

Regular expressions match only strings, by definition.

Of course, in theory you could construct an equivalent grammar, say for lists of numbers. With new tokens like \e for even numbers, \o for odd numbers, \s for square numbers, \r for real numbers etc., so that

[1, 2, 3, 4, 5, 6]

would be matched by

^(\o\e)*$

or

[ln(3), math.pi, sqrt(-1)]

would be matched by

^\R*$

etc. Sounds like a fun project, but also like a very difficult one. And how this could be expanded to handle arbitrary lists, nested and all, is beyond me.

like image 72
Tim Pietzcker Avatar answered Oct 17 '22 19:10

Tim Pietzcker


Some of the regex syntax generalize to generic sequences. Also, to be able to specify any object, strings is not the best medium for the expression themselves.

"Small" example in python:

def choice(*items):
  return ('choice',[value(item) for item in items])

def seq(*items):
  return ('seq',[value(item) for item in items])

def repeat(min,max,lazy,item):
  return ('repeat',min,max,lazy,value(item))

def value(item):
  if not isinstance(item,tuple):
    return ('value',item)
  return item

def compile(pattern):
  ret = []
  key = pattern[0]
  if key == 'value':
    ret.append(('literal',pattern[1]))
  elif key == 'seq':
    for subpattern in pattern[1]:
      ret.extend(compile(subpattern))
  elif key == 'choice':
    jumps = []
    n = len(pattern[1])
    for i,subpattern in enumerate(pattern[1]):
      if i < n-1:
        pos = len(ret)
        ret.append('placeholder for choice')
        ret.extend(compile(subpattern))
        jumps.append(len(ret))
        ret.append('placeholder for jump')
        ret[pos] = ('choice',len(ret)-pos)
      else:
        ret.extend(compile(subpattern))
    for pos in jumps:
      ret[pos] = ('jump', len(ret)-pos)
  elif key == 'repeat':
    min,max,lazy,subpattern = pattern[1:]
    for _ in xrange(min):
      ret.extend(compile(subpattern))
    if max == -1:
      if lazy:
        pos = len(ret)
        ret.append('placeholder for jump')
        ret.extend(compile(subpattern))
        ret[pos] = ('jump',len(ret)-pos)
        ret.append(('choice',pos+1-len(ret)))
      else:
        pos = len(ret)
        ret.append('placeholder for choice')
        ret.extend(compile(subpattern))
        ret.append(('jump',pos-len(ret)))
        ret[pos] = ('choice',len(ret)-pos)
    elif max > min:
      if lazy:
        jumps = []
        for _ in xrange(min,max):
          ret.append(('choice',2))
          jumps.append(len(ret))
          ret.append('placeholder for jump')
          ret.extend(compile(subpattern))
        for pos in jumps:
          ret[pos] = ('jump', len(ret)-pos)
      else:
        choices = []
        for _ in xrange(min,max):
          choices.append(len(ret))
          ret.append('placeholder for choice')
          ret.extend(compile(subpattern))
          ret.append(('drop,'))
        for pos in choices:
          ret[pos] = ('choice',len(ret)-pos)
  return ret

def match(pattern,subject,start=0):
  stack = []
  pos = start
  i = 0
  while i < len(pattern):
    instruction = pattern[i]
    key = instruction[0]
    if key == 'literal':
      if pos < len(subject) and subject[pos] == instruction[1]:
        i += 1
        pos += 1
        continue
    elif key == 'jump':
      i += instruction[1]
      continue
    elif key == 'choice':
      stack.append((i+instruction[1],pos))
      i += 1
      continue
    # fail
    if not stack:
      return None
    i,pos = stack.pop()
  return pos

def find(pattern,subject,start=0):
  for pos1 in xrange(start,len(subject)+1):
    pos2 = match(pattern,subject,pos1)
    if pos2 is not None: return pos1,pos2
  return None,None

def find_all(pattern,subject,start=0):
  matches = []
  pos1,pos2 = find(pattern,subject,start)
  while pos1 is not None:
    matches.append((pos1,pos2))
    pos1,pos2 = find(pattern,subject,pos2)
  return matches

# Timestamps: ([01][0-9]|2[0-3])[0-5][0-9]
pattern = compile(
  seq(
    choice(
      seq(choice(0,1),choice(0,1,2,3,4,5,6,7,8,9)),
      seq(2,choice(0,1,2,3)),
    ),
    choice(0,1,2,3,4,5),
    choice(0,1,2,3,4,5,6,7,8,9),
  )
)

print find_all(pattern,[1,3,2,5,6,3,4,2,4,3,2,2,3,6,6,5,3,5,3,3,2,5,4,5])
# matches: (0,4): [1,3,2,5]; (10,14): [2,2,3,6]

A few points of improvement:

  • More constructs: classes with negation, ranges
  • Classes instead of tuples
like image 1
Markus Jarderot Avatar answered Oct 17 '22 19:10

Markus Jarderot