Suppose you want to do regular expression search and extract over a pipe, but the pattern may cross multiple lines, How to do it? Maybe a regular expression library work for a stream?
I hope do this job using Python library? But any solution will be OK, a library not a cmd line tool of course.
BTW, I know how to solve my current problem, just seeking a general solution.
If no such libray exists, why regular library can not work with stream given the regular mathing algorithm never need backward scaning.
If you are after a general solution, your algorithm would need to look something like:
match.end()
and go to step 2.This could end up using a lot of memory if no matches are found, but it is difficult to do better in the general case (consider trying to match .*x
as a multi-line regexp in a large file where the only x
is the last character).
If you know more about the regexp, you might have other cases where you can discard part of the buffer.
I solved a similar problem for searching a stream using classic pattern matching. You may want to subclass the Matcher class of my solution streamsearch-py and perform regex matching in the buffer. Check out the included kmp_example.py below for a template. If it turns out classic Knuth-Morris-Pratt matching is all you need, then your problem would be solved right now with this little open source library :-)
#!/usr/bin/env python
# Copyright 2014-2015 @gitagon. For alternative licenses contact the author.
#
# This file is part of streamsearch-py.
# streamsearch-py is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# streamsearch-py is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
# You should have received a copy of the GNU Affero General Public License
# along with streamsearch-py. If not, see <http://www.gnu.org/licenses/>.
from streamsearch.matcher_kmp import MatcherKMP
from streamsearch.buffer_reader import BufferReader
class StringReader():
"""for providing an example read() from string required by BufferReader"""
def __init__(self, string):
self.s = string
self.i = 0
def read(self, buf, cnt):
if self.i >= len(self.s): return -1
r = self.s[self.i]
buf[0] = r
result = 1
print "read @%s" % self.i, chr(r), "->", result
self.i+=1
return result
def main():
w = bytearray("abbab")
print "pattern of length %i:" % len(w), w
s = bytearray("aabbaabbabababbbc")
print "text:", s
m = MatcherKMP(w)
r = StringReader(s)
b = BufferReader(r.read, 200)
m.find(b)
print "found:%s, pos=%s " % (m.found(), m.get_index())
if __name__ == '__main__':
main()
output is
pattern of length 5: abbab
text: aabbaabbabababbbc
read @0 a -> 1
read @1 a -> 1
read @2 b -> 1
read @3 b -> 1
read @4 a -> 1
read @5 a -> 1
read @6 b -> 1
read @7 b -> 1
read @8 a -> 1
read @9 b -> 1
found:True, pos=5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With