I have text that looks like this:
bla bla bla
bla some on wanted text....
****************************************************************************
List of 12 base pairs
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
****************************************************************************
another unwanted text ...
another unwanted text
Want I want to do is to extract the section that starts with List of xxx base pairs
and end with first ***** that it encounters.
There are cases where this section does not appear at all. If that happen
it should output just "NONE".
How can I do that with Python?
I tried this but failed. That it prints no output at all.
import sys
import re
def main():
"""docstring for main"""
infile = "myfile.txt"
if len(sys.argv) > 1:
infile = sys.argv[1]
regex = re.compile(r"""List of (\d+) base pairs$""",re.VERBOSE)
with open(infile, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter='\t')
for row in tabreader:
if row:
line = row[0]
match = regex.match(line)
if match:
print line
if __name__ == '__main__':
main()
At the end of the code I was hoping it would just print this:
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
Or simply
NONE
At the end of the code I was hoping it would just print this:
There are couple of problems. The regex is a little too restrictive. The loop doesn't recognize the regex match as the starting point. And there isn't an early exit for the ******* endpoint.
Here's some working code to get you started:
import re
text = '''
bla bla bla
bla some on wanted text....
****************************************************************************
List of 12 base pairs
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
****************************************************************************
another unwanted text ...
another unwanted text
'''
regex = re.compile(r"List of (\d+) base pairs")
started = False
for line in text.splitlines():
if started:
if line.startswith('*******'):
break
print line
elif regex.search(line):
started = True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With