Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing plain text with section in Python

Tags:

python

parsing

I have text that looks like this:

    bla bla bla 
    bla some on wanted text....

****************************************************************************
List of 12 base pairs
      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

****************************************************************************
another unwanted text ...
another unwanted text 

Want I want to do is to extract the section that starts with List of xxx base pairs and end with first ***** that it encounters.

There are cases where this section does not appear at all. If that happen it should output just "NONE".

How can I do that with Python?

I tried this but failed. That it prints no output at all.

import sys
import re

def main():
    """docstring for main"""
    infile = "myfile.txt"
    if len(sys.argv) > 1:
        infile = sys.argv[1]

    regex = re.compile(r"""List of (\d+) base pairs$""",re.VERBOSE)

    with open(infile, 'r') as tsvfile:
        tabreader = csv.reader(tsvfile, delimiter='\t')

        for row in tabreader:
            if row:
                line = row[0]
                match = regex.match(line)
                if match:
                    print line



if __name__ == '__main__':
    main()

At the end of the code I was hoping it would just print this:

      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

Or simply

NONE
like image 711
pdubois Avatar asked Feb 23 '26 00:02

pdubois


1 Answers

At the end of the code I was hoping it would just print this:

There are couple of problems. The regex is a little too restrictive. The loop doesn't recognize the regex match as the starting point. And there isn't an early exit for the ******* endpoint.

Here's some working code to get you started:

import re

text = '''
    bla bla bla 
    bla some on wanted text....

****************************************************************************
List of 12 base pairs
      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

****************************************************************************
another unwanted text ...
another unwanted text
'''

regex = re.compile(r"List of (\d+) base pairs")

started = False
for line in text.splitlines():
    if started:
        if line.startswith('*******'):
            break
        print line
    elif regex.search(line):
        started = True
like image 67
Raymond Hettinger Avatar answered Feb 25 '26 13:02

Raymond Hettinger