I'm working with a large list containing integers and I would like to do some pattern matching on them (like finding certain sequences). Regular expressions would be the perfect fit, except that they always seem to only handle lists of characters, a.k.a. strings. Is there any library (in any language) that can handle lists of an arbitrary type? I'm aware that I could convert my integer list into a string and then do a normal regular expression search but that seems a bit wasteful and inelegant. edit: My requirements are fairly simple. No need for nested lists, no need for fancy character classes. Basically I'm just interested in occurrences of sequences that can get pretty complicated. (e.g. something like <code>"[abc]{3,}.([ab]?[^a]{4,7})"</code> etc. where <code>a</code>,<code>b</code>,<code>c</code> are integers). This should be possible to generalize over any type which can be checked for equality. For an enumerable type you could also get things like <code>"[a-z]"</code> to work.

Regular expressions match only strings, by definition. Of course, in theory you could construct an equivalent grammar, say for lists of numbers. With new tokens like <code>\e</code> for even numbers, <code>\o</code> for odd numbers, <code>\s</code> for square numbers, <code>\r</code> for real numbers etc., so that <pre class="prettyprint"><code>[1, 2, 3, 4, 5, 6] </code></pre> would be matched by <pre class="prettyprint"><code>^(\o\e)*$ </code></pre> or <pre class="prettyprint"><code>[ln(3), math.pi, sqrt(-1)] </code></pre> would be matched by <pre class="prettyprint"><code>^\R*$ </code></pre> etc. Sounds like a fun project, but also like a very difficult one. And how this could be expanded to handle arbitrary lists, nested and all, is beyond me.

Generalization for regular expression on any list

Tags:

regex

I'm working with a large list containing integers and I would like to do some pattern matching on them (like finding certain sequences). Regular expressions would be the perfect fit, except that they always seem to only handle lists of characters, a.k.a. strings. Is there any library (in any language) that can handle lists of an arbitrary type?

I'm aware that I could convert my integer list into a string and then do a normal regular expression search but that seems a bit wasteful and inelegant.

edit:

My requirements are fairly simple. No need for nested lists, no need for fancy character classes. Basically I'm just interested in occurrences of sequences that can get pretty complicated. (e.g. something like "[abc]{3,}.([ab]?[^a]{4,7})" etc. where a,b,c are integers). This should be possible to generalize over any type which can be checked for equality. For an enumerable type you could also get things like "[a-z]" to work.

666

asked Nov 29 '10 11:11

pafcu

2 Answers

Regular expressions match only strings, by definition.

Of course, in theory you could construct an equivalent grammar, say for lists of numbers. With new tokens like \e for even numbers, \o for odd numbers, \s for square numbers, \r for real numbers etc., so that

[1, 2, 3, 4, 5, 6]

would be matched by

^(\o\e)*$

[ln(3), math.pi, sqrt(-1)]

would be matched by

^\R*$

etc. Sounds like a fun project, but also like a very difficult one. And how this could be expanded to handle arbitrary lists, nested and all, is beyond me.

answered Oct 17 '22 19:10

Tim Pietzcker

Some of the regex syntax generalize to generic sequences. Also, to be able to specify any object, strings is not the best medium for the expression themselves.

"Small" example in python:

def choice(*items):
  return ('choice',[value(item) for item in items])

def seq(*items):
  return ('seq',[value(item) for item in items])

def repeat(min,max,lazy,item):
  return ('repeat',min,max,lazy,value(item))

def value(item):
  if not isinstance(item,tuple):
    return ('value',item)
  return item

def compile(pattern):
  ret = []
  key = pattern[0]
  if key == 'value':
    ret.append(('literal',pattern[1]))
  elif key == 'seq':
    for subpattern in pattern[1]:
      ret.extend(compile(subpattern))
  elif key == 'choice':
    jumps = []
    n = len(pattern[1])
    for i,subpattern in enumerate(pattern[1]):
      if i < n-1:
        pos = len(ret)
        ret.append('placeholder for choice')
        ret.extend(compile(subpattern))
        jumps.append(len(ret))
        ret.append('placeholder for jump')
        ret[pos] = ('choice',len(ret)-pos)
      else:
        ret.extend(compile(subpattern))
    for pos in jumps:
      ret[pos] = ('jump', len(ret)-pos)
  elif key == 'repeat':
    min,max,lazy,subpattern = pattern[1:]
    for _ in xrange(min):
      ret.extend(compile(subpattern))
    if max == -1:
      if lazy:
        pos = len(ret)
        ret.append('placeholder for jump')
        ret.extend(compile(subpattern))
        ret[pos] = ('jump',len(ret)-pos)
        ret.append(('choice',pos+1-len(ret)))
      else:
        pos = len(ret)
        ret.append('placeholder for choice')
        ret.extend(compile(subpattern))
        ret.append(('jump',pos-len(ret)))
        ret[pos] = ('choice',len(ret)-pos)
    elif max > min:
      if lazy:
        jumps = []
        for _ in xrange(min,max):
          ret.append(('choice',2))
          jumps.append(len(ret))
          ret.append('placeholder for jump')
          ret.extend(compile(subpattern))
        for pos in jumps:
          ret[pos] = ('jump', len(ret)-pos)
      else:
        choices = []
        for _ in xrange(min,max):
          choices.append(len(ret))
          ret.append('placeholder for choice')
          ret.extend(compile(subpattern))
          ret.append(('drop,'))
        for pos in choices:
          ret[pos] = ('choice',len(ret)-pos)
  return ret

def match(pattern,subject,start=0):
  stack = []
  pos = start
  i = 0
  while i < len(pattern):
    instruction = pattern[i]
    key = instruction[0]
    if key == 'literal':
      if pos < len(subject) and subject[pos] == instruction[1]:
        i += 1
        pos += 1
        continue
    elif key == 'jump':
      i += instruction[1]
      continue
    elif key == 'choice':
      stack.append((i+instruction[1],pos))
      i += 1
      continue
    # fail
    if not stack:
      return None
    i,pos = stack.pop()
  return pos

def find(pattern,subject,start=0):
  for pos1 in xrange(start,len(subject)+1):
    pos2 = match(pattern,subject,pos1)
    if pos2 is not None: return pos1,pos2
  return None,None

def find_all(pattern,subject,start=0):
  matches = []
  pos1,pos2 = find(pattern,subject,start)
  while pos1 is not None:
    matches.append((pos1,pos2))
    pos1,pos2 = find(pattern,subject,pos2)
  return matches

# Timestamps: ([01][0-9]|2[0-3])[0-5][0-9]
pattern = compile(
  seq(
    choice(
      seq(choice(0,1),choice(0,1,2,3,4,5,6,7,8,9)),
      seq(2,choice(0,1,2,3)),
    ),
    choice(0,1,2,3,4,5),
    choice(0,1,2,3,4,5,6,7,8,9),
  )
)

print find_all(pattern,[1,3,2,5,6,3,4,2,4,3,2,2,3,6,6,5,3,5,3,3,2,5,4,5])
# matches: (0,4): [1,3,2,5]; (10,14): [2,2,3,6]

A few points of improvement:

More constructs: classes with negation, ranges
Classes instead of tuples

answered Oct 17 '22 19:10

Markus Jarderot

Related questions
                            
                                How can I remove non-numeric characters from strings using gsub in R?
                            
                                PHP and regexp to accept only Greek characters in form
                            
                                Python string match
                            
                                How do I retrieve an integer's ordinal suffix in Perl (like st, nd, rd, th)
                            
                                jQuery function to to format number with commas and decimal
                            
                                Regular Expressions for file name matching
                            
                                Removing Dollar and comma from string
                            
                                Split by '/' till '[' appears
                            
                                Regex matching numbers and decimals
                            
                                Regular Expression to extract php code partially (( array definition ))
                            
                                Remove all html tags from attributes in rails
                            
                                Replace the last occurrence of a string in another string [duplicate]
                            
                                Split string by char in java
                            
                                how to remove brackets character in string (java)
                            
                                Java - How to test if a String contains both letters and numbers
                            
                                Regular expression to validate hex string
                            
                                How can I delete all /* */ comments from a C source file?
                            
                                Why this regex is not working for german words?
                            
                                How can I escape meta-characters when I interpolate a variable in Perl's match operator?
                            
                                Is it possible for Perl to preserve case in a substitution? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With