Regular expression matching a multiline block of text

People also ask

What is multiline mode in regex?

Multiline option, or the m inline option, enables the regular expression engine to handle an input string that consists of multiple lines. It changes the interpretation of the ^ and $ language elements so that they match the beginning and end of a line, instead of the beginning and end of the input string.

What is multiline flag in regex?

The " m " flag indicates that a multiline input string should be treated as multiple lines. For example, if " m " is used, " ^ " and " $ " change from matching at only the start or end of the entire string to the start or end of any line within the string. You cannot change this property directly.

How do you match line breaks in regex?

Line breaks In pattern matching, the symbols “^” and “$” match the beginning and end of the full file, not the beginning and end of a line. If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”.

What is multiline pattern match?

Pattern. MULTILINE or (? m) tells Java to accept the anchors ^ and $ to match at the start and end of each line (otherwise they only match at the start/end of the entire string). Pattern.

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)

The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).

The following is a regular expression matching a multiline block of text:

import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)

If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
    return title, aminoacid_sequence

Related questions
                            
                                Update index after sorting data-frame
                            
                                How to ignore the first line of data when processing CSV data?
                            
                                Python matplotlib multiple bars
                            
                                Convert SVG to PNG in Python
                            
                                pycharm convert tabs to spaces automatically
                            
                                Python module os.chmod(file, 664) does not change the permission to rw-rw-r-- but -w--wx----
                            
                                SQLAlchemy: Creating vs. Reusing a Session
                            
                                Using python map and other functional tools
                            
                                How to measure time taken between lines of code in python?
                            
                                Convert image from PIL to openCV format
                            
                                python numpy machine epsilon
                            
                                I want to exception handle 'list index out of range.'
                            
                                Move column by name to front of table in pandas
                            
                                Using Python String Formatting with Lists
                            
                                How do I exchange keys with values in a dictionary?
                            
                                Python: Making a beep noise
                            
                                Return datetime object of previous month
                            
                                How to compile python script to binary executable
                            
                                How to pickle or store Jupyter (IPython) notebook session for later
                            
                                What does a b prefix before a python string mean?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular expression matching a multiline block of text

Tags:

python

regex

multiline

People also ask

Recent Activity

Donate For Us