Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use of re.MULTILINE and re.DOTALL together python

Tags:

python

regex

Basically the input files goes like this:

>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete

       cds. #some records don't have this line (see below)

       Length = 2575

(some text)

>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete

       Length = 2575

(some text)

(etc...)

Now I wrote this to extract the line that starts with > and the number for length

import re
regex = re.compile("^(>.*)\r\n.*Length\s=\s(\d+)", re.MULTILINE)
match = regex.findall(sample_blast.read())

print match[0]

which works fine for extracting records when the Length line is the next line to the > line.

Then I tried re.DOTALL which should make any record match (.*Length) regardless if there is an extra line or not.

regex = re.compile("^(>.*)\r\n.*(?:\r\n*.?)Length\s=\s(\d+)", re.MULTILINE|re.DOTALL)

But it does not work. I tried re.MULTILINE and re.DOTALL instead of pipe, but still does not work.

So the question is how to create a regex that match the records and return the desired group regardless of the extra line in record or not. Would be nice if someone can show this in re.VERBOSE as well. Sorry for the long post and thanks for any help in advance. :)

like image 311
noqa Avatar asked Oct 28 '12 16:10

noqa


1 Answers

Your problem is likely your use of \r\n. Instead, try using only \n:

>>> x = """
... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
... 
...        cds. #some records don't have this line (see below)
... 
...        Length = 2575
... (some text)
... 
... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
... 
...        Length = 2575
... (some text)
... 
... (etc...)
... """
>>> re.search("^(>.*)\n.*(?:\n*.?)Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL)
<_sre.SRE_Match object at 0x10c937e00>
>>> _.group(2)
'2575'

Additionally, your first .* is too greedy. Instead, try using: ^(>.*?)$.*?Length\s=\s(\d+):

>>> re.findall("^(>.*?)$.*?Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL)
[('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575'), ('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575')]
like image 158
David Wolever Avatar answered Nov 15 '22 15:11

David Wolever