Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R string removes punctuation on split

Tags:

regex

r

Say I have a string for example the following.

x <- 'The world is at end. What do you think?   I am going crazy!    These people are too calm.'

I need to split only on the punctuation !?. and following whitespace and keep the punctuation with it.

This removes the punctuation and leaves leading spaces in the split parts though

vec <- strsplit(x, '[!?.][:space:]*')

How can I split sentences leaving the punctuation?

like image 640
paulie.jvenuez Avatar asked Nov 01 '13 03:11

paulie.jvenuez


1 Answers

You can switch on PCRE by using perl=TRUE and use a lookbehind assertion.

strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)

Regular expression:

(?<!          look behind to see if there is not:
 [^!?.]       any character except: '!', '?', '.'
)             end of look-behind
\s+           whitespace (\n, \r, \t, \f, and " ") (1 or more times)

Live Demo

like image 191
hwnd Avatar answered Sep 29 '22 07:09

hwnd