I'm working with a large list containing integers and I would like to do some pattern matching on them (like finding certain sequences). Regular expressions would be the perfect fit, except that they always seem to only handle lists of characters, a.k.a. strings. Is there any library (in any language) that can handle lists of an arbitrary type?
I'm aware that I could convert my integer list into a string and then do a normal regular expression search but that seems a bit wasteful and inelegant.
edit:
My requirements are fairly simple. No need for nested lists, no need for fancy character classes. Basically I'm just interested in occurrences of sequences that can get pretty complicated. (e.g. something like "[abc]{3,}.([ab]?[^a]{4,7})"
etc. where a
,b
,c
are integers). This should be possible to generalize over any type which can be checked for equality. For an enumerable type you could also get things like "[a-z]"
to work.
Regular expressions are a generalized way to match patterns with sequences of characters. It is used in every programming language like C++, Java and Python.
Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.
* - means "0 or more instances of the preceding regex token"
\$ will help to find the character "$" available in the content based on the expression flags assigned to the regular expression.
Regular expressions match only strings, by definition.
Of course, in theory you could construct an equivalent grammar, say for lists of numbers. With new tokens like \e
for even numbers, \o
for odd numbers, \s
for square numbers, \r
for real numbers etc., so that
[1, 2, 3, 4, 5, 6]
would be matched by
^(\o\e)*$
or
[ln(3), math.pi, sqrt(-1)]
would be matched by
^\R*$
etc. Sounds like a fun project, but also like a very difficult one. And how this could be expanded to handle arbitrary lists, nested and all, is beyond me.
Some of the regex syntax generalize to generic sequences. Also, to be able to specify any object, strings is not the best medium for the expression themselves.
"Small" example in python:
def choice(*items):
return ('choice',[value(item) for item in items])
def seq(*items):
return ('seq',[value(item) for item in items])
def repeat(min,max,lazy,item):
return ('repeat',min,max,lazy,value(item))
def value(item):
if not isinstance(item,tuple):
return ('value',item)
return item
def compile(pattern):
ret = []
key = pattern[0]
if key == 'value':
ret.append(('literal',pattern[1]))
elif key == 'seq':
for subpattern in pattern[1]:
ret.extend(compile(subpattern))
elif key == 'choice':
jumps = []
n = len(pattern[1])
for i,subpattern in enumerate(pattern[1]):
if i < n-1:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret[pos] = ('choice',len(ret)-pos)
else:
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
elif key == 'repeat':
min,max,lazy,subpattern = pattern[1:]
for _ in xrange(min):
ret.extend(compile(subpattern))
if max == -1:
if lazy:
pos = len(ret)
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
ret[pos] = ('jump',len(ret)-pos)
ret.append(('choice',pos+1-len(ret)))
else:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('jump',pos-len(ret)))
ret[pos] = ('choice',len(ret)-pos)
elif max > min:
if lazy:
jumps = []
for _ in xrange(min,max):
ret.append(('choice',2))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
else:
choices = []
for _ in xrange(min,max):
choices.append(len(ret))
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('drop,'))
for pos in choices:
ret[pos] = ('choice',len(ret)-pos)
return ret
def match(pattern,subject,start=0):
stack = []
pos = start
i = 0
while i < len(pattern):
instruction = pattern[i]
key = instruction[0]
if key == 'literal':
if pos < len(subject) and subject[pos] == instruction[1]:
i += 1
pos += 1
continue
elif key == 'jump':
i += instruction[1]
continue
elif key == 'choice':
stack.append((i+instruction[1],pos))
i += 1
continue
# fail
if not stack:
return None
i,pos = stack.pop()
return pos
def find(pattern,subject,start=0):
for pos1 in xrange(start,len(subject)+1):
pos2 = match(pattern,subject,pos1)
if pos2 is not None: return pos1,pos2
return None,None
def find_all(pattern,subject,start=0):
matches = []
pos1,pos2 = find(pattern,subject,start)
while pos1 is not None:
matches.append((pos1,pos2))
pos1,pos2 = find(pattern,subject,pos2)
return matches
# Timestamps: ([01][0-9]|2[0-3])[0-5][0-9]
pattern = compile(
seq(
choice(
seq(choice(0,1),choice(0,1,2,3,4,5,6,7,8,9)),
seq(2,choice(0,1,2,3)),
),
choice(0,1,2,3,4,5),
choice(0,1,2,3,4,5,6,7,8,9),
)
)
print find_all(pattern,[1,3,2,5,6,3,4,2,4,3,2,2,3,6,6,5,3,5,3,3,2,5,4,5])
# matches: (0,4): [1,3,2,5]; (10,14): [2,2,3,6]
A few points of improvement:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With