Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression on stream instead of string?

Tags:

python

regex

Suppose you want to do regular expression search and extract over a pipe, but the pattern may cross multiple lines, How to do it? Maybe a regular expression library work for a stream?

I hope do this job using Python library? But any solution will be OK, a library not a cmd line tool of course.

BTW, I know how to solve my current problem, just seeking a general solution.

If no such libray exists, why regular library can not work with stream given the regular mathing algorithm never need backward scaning.

like image 972
user1733712 Avatar asked Oct 22 '12 02:10

user1733712


2 Answers

If you are after a general solution, your algorithm would need to look something like:

  1. Read a chunk of the stream into a buffer.
  2. Search for the regexp in the buffer
  3. If the pattern matches, do whatever you want with the match, discard the start of the buffer up to match.end() and go to step 2.
  4. If the pattern does not match, extend the buffer with more data from the stream

This could end up using a lot of memory if no matches are found, but it is difficult to do better in the general case (consider trying to match .*x as a multi-line regexp in a large file where the only x is the last character).

If you know more about the regexp, you might have other cases where you can discard part of the buffer.

like image 66
James Henstridge Avatar answered Nov 09 '22 18:11

James Henstridge


I solved a similar problem for searching a stream using classic pattern matching. You may want to subclass the Matcher class of my solution streamsearch-py and perform regex matching in the buffer. Check out the included kmp_example.py below for a template. If it turns out classic Knuth-Morris-Pratt matching is all you need, then your problem would be solved right now with this little open source library :-)

#!/usr/bin/env python

# Copyright 2014-2015 @gitagon. For alternative licenses contact the author.
# 
# This file is part of streamsearch-py.
# streamsearch-py is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# 
# streamsearch-py is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
# You should have received a copy of the GNU Affero General Public License
# along with streamsearch-py.  If not, see <http://www.gnu.org/licenses/>.


from streamsearch.matcher_kmp import MatcherKMP
from streamsearch.buffer_reader import BufferReader

class StringReader():
    """for providing an example read() from string required by BufferReader"""
    def __init__(self, string):
        self.s = string
        self.i = 0

    def read(self, buf, cnt):
        if self.i >= len(self.s): return -1
        r = self.s[self.i]
        buf[0] = r
        result = 1
        print "read @%s" % self.i, chr(r), "->", result
        self.i+=1
        return result

def main():

    w = bytearray("abbab")
    print "pattern of length %i:" % len(w), w
    s = bytearray("aabbaabbabababbbc")
    print "text:", s
    m = MatcherKMP(w)
    r = StringReader(s)
    b = BufferReader(r.read, 200)
    m.find(b)
    print "found:%s, pos=%s " % (m.found(), m.get_index())


if __name__ == '__main__':
    main()

output is

pattern of length 5: abbab
text: aabbaabbabababbbc
read @0 a -> 1
read @1 a -> 1
read @2 b -> 1
read @3 b -> 1
read @4 a -> 1
read @5 a -> 1
read @6 b -> 1
read @7 b -> 1
read @8 a -> 1
read @9 b -> 1
found:True, pos=5 
like image 29
gitagon Avatar answered Nov 09 '22 19:11

gitagon