Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-line Matching in Python

Tags:

python

regex

I've read all of the articles I could find, even understood a few of them but as a Python newb I'm still a little lost and hoping for help :)

I'm working on a script to parse items of interest out of an application specific log file, each line begins with a time stamp which I can match and I can define two things to identify what I want to capture, some partial content and a string that will be the termination of what I want to extract.

My issue is multi-line, in most cases every log line is terminated with a newline but some entries contain SQL that may have new lines within it and therefore creates new lines in the log.

So, in a simple case I may have this:

[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)

This all appears as one line which I can match with this:

re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')

However in some cases there may be line breaks in the SQL, as such I want to still capture it (and potentially replace the line breaks with spaces). I am currently reading the file a line at a time which obviously isn't going to work so...

  1. Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
  2. How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?

My overall goal is to parameterize this so I can use it for extracting log entries that match different patterns of the starting string (always the start of a line), the ending string (where I want to capture to) and a value that is between them as an identifier.

Thanks in advance for any help!

Chris.

import sys, getopt, os, re

sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')

lines = []
with open(logFileName, 'r') as f:
    for line in f:
        if lineStartsWith.match(line) and lineContains.match(line):
            if lineEndsWith.match(line) :
                print 'Full Line Found'
                print line
                print "- Record Separator -"
            else:
                print 'Partial Line Found'
                print line
                print "- Record Separator -"

print "--- DONE ----"

Next step, for my partial line I'll continue reading until I find lineEndsWith and assemble the lines in to one block.

I'm no expert so suggestions are always welcome!

UPDATE - So I have it working, thanks to all the responses that helped direct things, I realize it isn't pretty and I need to clean up my if / elif mess and make it more efficient but IT's WORKING! Thanks for all the help.

import sys, getopt, os, re

sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"

print "--- START ----"

lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')

lines = []

multiLine = False

with open(logFileName, 'r') as f:
    for line in f:
        if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
            lines.append(line.replace("\n", " "))
        elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
            #Found the start of a multi-line entry
            multiLineString = line
            multiLine = True
        elif multiLine and not lineEndsWith.match(line):
            multiLineString = multiLineString + line
        elif multiLine and lineEndsWith.match(line):
            multiLineString = multiLineString + line
            multiLineString = multiLineString.replace("\n", " ")
            lines.append(multiLineString)
            multiLine = False

for line in lines:
    print line
like image 414
Chris Avatar asked Aug 28 '13 17:08

Chris


1 Answers

Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?

There are two options here.

You could read the file block by block, making sure to attach any "leftover" bit at the end of each block to the start of the next one, and search each block. Of course you will have to figure out what counts as "leftover" by looking at what your data format is and what your regex can match, and in theory it's possible for multiple blocks to all count as leftover…

Or you could just mmap the file. An mmap acts like a bytes (or like a str in Python 2.x), and leaves it up to the OS to handle paging blocks in and out as necessary. Unless you're trying to deal with absolutely huge files (gigabytes in 32-bit, even more in 64-bit), this is trivial and efficient:

with open('bigfile', 'rb') as f:
    with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
        for match in compiled_re.finditer(m):
            do_stuff(match)

In older versions of Python, mmap isn't a context manager, so you'll need to wrap contextlib.closing around it (or just use an explicit close if you prefer).


How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?

You could use the DOTALL flag, which makes the . match newlines. You could instead use the MULTILINE flag and put appropriate $ and/or ^ characters in, but that makes simple cases a lot harder, and it's rarely necessary. Here's an example with DOTALL (using a simpler regexp to make it more obvious):

>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and 
    (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']

As you can see the second .*? matched the newline just as easily as a space.

If you're just trying to treat a newline as whitespace, you don't need either; '\s' already catches newlines.

For example:

>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']
like image 110
abarnert Avatar answered Oct 04 '22 04:10

abarnert