python regular expression to split paragraphs

Question

How would one write a regular expression to use in python to split paragraphs?

A paragraph is defined by 2 linebreaks ( ). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.

I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...) stuff)

Examples:

the_str = 'paragraph1

paragraph2'
# splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1
	
p2	
	still p2	   
     
	p3'
# should yield ['p1', 'p2	
	still p2', 'p3']

the_str = 'p1


	p2'
# should yield ['p1', '
	p2']

The best I could come with is: r'[ \f\v]* [ \f\v]* [ \f\v]*', i.e.

import re
paragraphs = re.split(r'[ 	
\f\v]*
[ 	
\f\v]*
[ 	
\f\v]*', the_str)

but that is ugly. Anything better?

EDIT:

Suggestions rejected:

r'\s*? \s*? \s*?' -> That would make example 2 and 3 fail, since \s includes , so it would allow paragraph breaks with more than 2 s.

Eevee · Accepted Answer

Unfortunately there's no nice way to write "space but not a newline".

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \f\v]*? ){2} [ \f\v]*?

You could also try creating a subrule just for the character class and interpolating it three times.

S.Lott · Answer

Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?

You might be able to simply use the Docutils parser rather than roll your own.

python regular expression to split paragraphs

Tags:

python

regex

text

split

parsing

Examples:

Suggestions rejected:

nosklo

2 Answers

Eevee

S.Lott

Recent Activity

Donate For Us

python regular expression to split paragraphs

Tags:

python

regex

text

split

parsing

Examples:

Suggestions rejected:

nosklo

2 Answers

Eevee

S.Lott

Related questions

Recent Activity

Donate For Us