Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Meaning of a bit of perl's Regular Expression?

Tags:

regex

perl

I'm translating code from perl and I've come accross the following line

$text =~ s/([?!\.][\ ]*[\'\"\)\]\p{IsPf}]+) +([\'\"\(\[\¿\¡\p{IsPi}]*[\ ]*[\p{IsUpper}])/$1\n$2/g;

My question is, what does \p{IsPf} and \p{IsPi} match to? I've tried searching online for it but haven't found anything...

like image 638
Meh Nada Avatar asked Jun 04 '13 11:06

Meh Nada


3 Answers

\p{..} matches characters by their unicode character properties: http://perldoc.perl.org/perlunicode.html#Unicode-Character-Properties

In particular, \p{IsPf} matches characters with the "final punctuation" property, and \p{IsPi} matches charactes with the "initial punctuation" property. These seem to be mostly closing and opening quotes.

The point of the substitution seems to be breaking sentences into separate lines by matching the end and beginning of a sentence, taking into account that a sentence may start and end with various types of punctuation.

like image 73
Joni Avatar answered Sep 20 '22 14:09

Joni


Let's ask RegexBuddy: It's a Unicode character property.

RegexBuddy Screenshot

You can find more documentation on Unicode character properties and Unicode scripts here.

like image 43
Tim Pietzcker Avatar answered Sep 20 '22 14:09

Tim Pietzcker


As a bit of extra information, unichars from Unicode::Tussle can be used to list the matching characters.

$ unichars -au '\p{IsPi}' | cat
 «  U+000AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
 ‘  U+02018 LEFT SINGLE QUOTATION MARK
 ‛  U+0201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
 “  U+0201C LEFT DOUBLE QUOTATION MARK
 ‟  U+0201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
 ‹  U+02039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
 ⸂  U+02E02 LEFT SUBSTITUTION BRACKET
 ⸄  U+02E04 LEFT DOTTED SUBSTITUTION BRACKET
 ⸉  U+02E09 LEFT TRANSPOSITION BRACKET
 ⸌  U+02E0C LEFT RAISED OMISSION BRACKET
 ⸜  U+02E1C LEFT LOW PARAPHRASE BRACKET
 ⸠  U+02E20 LEFT VERTICAL BAR WITH QUILL

$ unichars -au '\p{IsPf}' | cat
 »  U+000BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
 ’  U+02019 RIGHT SINGLE QUOTATION MARK
 ”  U+0201D RIGHT DOUBLE QUOTATION MARK
 ›  U+0203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
 ⸃  U+02E03 RIGHT SUBSTITUTION BRACKET
 ⸅  U+02E05 RIGHT DOTTED SUBSTITUTION BRACKET
 ⸊  U+02E0A RIGHT TRANSPOSITION BRACKET
 ⸍  U+02E0D RIGHT RAISED OMISSION BRACKET
 ⸝  U+02E1D RIGHT LOW PARAPHRASE BRACKET
 ⸡  U+02E21 RIGHT VERTICAL BAR WITH QUILL
like image 44
ikegami Avatar answered Sep 17 '22 14:09

ikegami