Use of re.MULTILINE and re.DOTALL together python

Question

Basically the input files goes like this:

>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
       cds. #some records don't have this line (see below)

       Length = 2575
(some text)

>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
       Length = 2575
(some text)

(etc...)

Now I wrote this to extract the line that starts with > and the number for length

import re
regex = re.compile("^(>.*)
.*Length\s=\s(\d+)", re.MULTILINE)
match = regex.findall(sample_blast.read())

print match[0]

which works fine for extracting records when the Length line is the next line to the > line.

Then I tried re.DOTALL which should make any record match (.*Length) regardless if there is an extra line or not.

regex = re.compile("^(>.*)
.*(?:
*.?)Length\s=\s(\d+)", re.MULTILINE|re.DOTALL)

But it does not work. I tried re.MULTILINE and re.DOTALL instead of pipe, but still does not work.

So the question is how to create a regex that match the records and return the desired group regardless of the extra line in record or not. Would be nice if someone can show this in re.VERBOSE as well. Sorry for the long post and thanks for any help in advance. :)

David Wolever · Accepted Answer

Your problem is likely your use of . Instead, try using only :

>>> x = """
... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
... 
...        cds. #some records don't have this line (see below)
... 
...        Length = 2575
... (some text)
... 
... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
... 
...        Length = 2575
... (some text)
... 
... (etc...)
... """
>>> re.search("^(>.*)
.*(?:
*.?)Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL)
<_sre.SRE_Match object at 0x10c937e00>
>>> _.group(2)
'2575'

Additionally, your first .* is too greedy. Instead, try using: ^(>.*?)$.*?Length\s=\s(\d+):

>>> re.findall("^(>.*?)$.*?Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL)
[('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575'), ('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575')]

Use of re.MULTILINE and re.DOTALL together python

Tags:

python

regex

noqa

1 Answers

David Wolever

Recent Activity

Donate For Us

Use of re.MULTILINE and re.DOTALL together python

Tags:

python

regex

noqa

1 Answers

David Wolever

Related questions

Recent Activity

Donate For Us