Multiline option, or the m inline option, enables the regular expression engine to handle an input string that consists of multiple lines. It changes the interpretation of the ^ and $ language elements so that they match the beginning and end of a line, instead of the beginning and end of the input string.
The " m " flag indicates that a multiline input string should be treated as multiple lines. For example, if " m " is used, " ^ " and " $ " change from matching at only the start or end of the entire string to the start or end of any line within the string. You cannot change this property directly.
Line breaks In pattern matching, the symbols “^” and “$” match the beginning and end of the full file, not the beginning and end of a line. If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”.
Pattern. MULTILINE or (? m) tells Java to accept the anchors ^ and $ to match at the start and end of each line (otherwise they only match at the start/end of the entire string). Pattern.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^
and $
anchors to match linefeeds, but they don't. In multiline mode, ^
matches the position immediately following a newline and $
matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n
), a carriage-return (\r
), or a carriage-return+linefeed (\r\n
). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
^
) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).(.+?)\n\n
means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.[A-Z]+\n
means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.((?:
textline)+)
means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.\n
in the regular expression if you want to enforce a double newline at the end.\n
or \r
or \r\n
) then just fix the regular expression by replacing every occurrence of \n
by (?:\n|\r\n?)
.The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With