Basically the input files goes like this:
>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
cds. #some records don't have this line (see below) Length = 2575
(some text)
>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
Length = 2575
(some text)
(etc...)
Now I wrote this to extract the line that starts with > and the number for length
import re
regex = re.compile("^(>.*)\r\n.*Length\s=\s(\d+)", re.MULTILINE)
match = regex.findall(sample_blast.read())
print match[0]
which works fine for extracting records when the Length line is the next line to the > line.
Then I tried re.DOTALL which should make any record match (.*Length) regardless if there is an extra line or not.
regex = re.compile("^(>.*)\r\n.*(?:\r\n*.?)Length\s=\s(\d+)", re.MULTILINE|re.DOTALL)
But it does not work. I tried re.MULTILINE and re.DOTALL instead of pipe, but still does not work.
So the question is how to create a regex that match the records and return the desired group regardless of the extra line in record or not. Would be nice if someone can show this in re.VERBOSE as well. Sorry for the long post and thanks for any help in advance. :)
Your problem is likely your use of \r\n
. Instead, try using only \n
:
>>> x = """ ... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete ... ... cds. #some records don't have this line (see below) ... ... Length = 2575 ... (some text) ... ... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete ... ... Length = 2575 ... (some text) ... ... (etc...) ... """ >>> re.search("^(>.*)\n.*(?:\n*.?)Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL) <_sre.SRE_Match object at 0x10c937e00> >>> _.group(2) '2575'
Additionally, your first .*
is too greedy. Instead, try using: ^(>.*?)$.*?Length\s=\s(\d+)
:
>>> re.findall("^(>.*?)$.*?Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL) [('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575'), ('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With