I've got a string that I'm trying to split into chunks based on blank lines. Given a string <code>s</code>, I thought I could do this: <pre class="prettyprint"><code>re.split('(?m)^\s*$', s) </code></pre> This works in some cases: <pre class="prettyprint"><code>>>> s = 'foo\nbar\n \nbaz' >>> re.split('(?m)^\s*$', s) ['foo\nbar\n', '\nbaz'] </code></pre> But it doesn't work if the line is completely empty: <pre class="prettyprint"><code>>>> s = 'foo\nbar\n\nbaz' >>> re.split('(?m)^\s*$', s) ['foo\nbar\n\nbaz'] </code></pre> What am I doing wrong? [python 2.5; no difference if I compile <code>'^\s*$'</code> with <code>re.MULTILINE</code> and use the compiled expression instead]

Try this instead: <pre class="prettyprint"><code>re.split('\n\s*\n', s) </code></pre> The problem is that "$ *^" actually only matches "spaces (if any) that are alone on a line"--not the newlines themselves. This leaves the delimiter empty when there's nothing on the line, which doesn't make sense. This version also gets rid of the delimiting newlines themselves, which is probably what you want. Otherwise, you'll have the newlines stuck to the beginning and end of each split part. Treating multiple consecutive blank lines as defining an empty block ("abc\n\n\ndef" -> ["abc", "", "def"]) is trickier...

Matching blank lines with regular expressions

Tags:

python

regex

I've got a string that I'm trying to split into chunks based on blank lines.

Given a string s, I thought I could do this:

re.split('(?m)^\s*$', s)

This works in some cases:

>>> s = 'foo\nbar\n \nbaz'
>>> re.split('(?m)^\s*$', s)
['foo\nbar\n', '\nbaz']

But it doesn't work if the line is completely empty:

>>> s = 'foo\nbar\n\nbaz'
>>> re.split('(?m)^\s*$', s)
['foo\nbar\n\nbaz']

What am I doing wrong?

[python 2.5; no difference if I compile '^\s*$' with re.MULTILINE and use the compiled expression instead]

844

asked Jul 29 '09 01:07

John Fouhy

2 Answers

Try this instead:

re.split('\n\s*\n', s)

The problem is that "$ *^" actually only matches "spaces (if any) that are alone on a line"--not the newlines themselves. This leaves the delimiter empty when there's nothing on the line, which doesn't make sense.

This version also gets rid of the delimiting newlines themselves, which is probably what you want. Otherwise, you'll have the newlines stuck to the beginning and end of each split part.

Treating multiple consecutive blank lines as defining an empty block ("abc\n\n\ndef" -> ["abc", "", "def"]) is trickier...

176

answered Sep 28 '22 18:09

Glenn Maynard

The re library can split on one or more empty lines ! An empty line is a string that consists of zero or more whitespaces, starts at the start of the line and ends at the end of a line. Special character '$' matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline (excerpt from docs). That's why we need to add a special character '\s*' for the line break. Everything is possible :-)

>>> import re
>>> text = "foo\n   \n    \n    \nbar\n"
>>> re.split("(?m)^\s*$\s*", text)
['foo\n', 'bar\n']

The same regex works with windows style line breaks.

>>> import re
>>> text = "foo\r\n       \r\n     \r\n   \r\nbar\r\n"
>>> re.split("(?m)^\s*$\s*", text)
['foo\r\n', 'bar\r\n']

answered Sep 28 '22 16:09

Sascha Gottfried

Related questions
                            
                                How to remove consecutive identical words from a string in python
                            
                                How to make lightweight docker image for python app with pipenv
                            
                                How to skip task in Airflow operator?
                            
                                How to resize a PyTorch tensor?
                            
                                Lossy conversion from float64 to uint8
                            
                                What is the difference in accessing Cloudflare website using ChromeDriver/Chrome in normal/headless mode through Selenium Python
                            
                                Python type hinted Dict syntax error mutable default is not allowed. Use 'default factory'
                            
                                How to find top N minimum values from the DataFrame, Python-3
                            
                                ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,)
                            
                                ModuleNotFoundError: No module named '_lzma' when building python using pyenv on macos
                            
                                Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered
                            
                                AttributeError: 'GridSearchCV' object has no attribute 'best_params_'
                            
                                How to fill elements between intervals of a list
                            
                                convert datetime64[ns, UTC] pandas column to datetime
                            
                                Python: TypeError: required field "type_ignores" missing from Module in Jupyter notebook
                            
                                Pydantic: Detect if a field value is missing or given as null
                            
                                How to map function directly over list of lists?
                            
                                Python: defaultdict became unmarshallable object in 2.6?
                            
                                How to redirect python warnings to a custom stream?
                            
                                Uploading multiple images in Django admin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With