Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python regular expression to split paragraphs

How would one write a regular expression to use in python to split paragraphs?

A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.

I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...) stuff)

Examples:

the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']

the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']

The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', i.e.

import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)

but that is ugly. Anything better?

EDIT:

Suggestions rejected:

r'\s*?\n\s*?\n\s*?' -> That would make example 2 and 3 fail, since \s includes \n, so it would allow paragraph breaks with more than 2 \ns.

like image 361
nosklo Avatar asked Sep 22 '08 18:09

nosklo


2 Answers

Unfortunately there's no nice way to write "space but not a newline".

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

You could also try creating a subrule just for the character class and interpolating it three times.

like image 101
Eevee Avatar answered Sep 23 '22 04:09

Eevee


Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?

You might be able to simply use the Docutils parser rather than roll your own.

like image 37
S.Lott Avatar answered Sep 21 '22 04:09

S.Lott