How would one write a regular expression to use in python to split paragraphs?
A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.
I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...)
stuff)
the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']
the_str = 'p1\n\t\np2\t\n\tstill p2\t \n \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']
the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']
The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*'
, i.e.
import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)
but that is ugly. Anything better?
EDIT:
r'\s*?\n\s*?\n\s*?'
-> That would make example 2 and 3 fail, since \s
includes \n
, so it would allow paragraph breaks with more than 2 \n
s.
Unfortunately there's no nice way to write "space but not a newline".
I think the best you can do is add some space with the x
modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?
You could also try creating a subrule just for the character class and interpolating it three times.
Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?
You might be able to simply use the Docutils parser rather than roll your own.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With