Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strsplit on all spaces and punctuation except apostrophes [duplicate]

Tags:

regex

r

I have asked related questions HERE and HERE. I tried to generalize these answers but have failed.

Basically I have a string I want to split into words, numbers and any sort of punctuation, yet, I want to retain the apostrophes. Here is what I've tried and I'm so close (I think):

x <- "Raptors don't like robots! I'd pay $500.00 to rid them."

strsplit(x, "(\\s+)|(?=[[:punct:]])", perl = TRUE)

## [[1]]
##  [1] "Raptors" "don"     "'"       "t"       "like"    "robots"  "!"             
##  [8] ""   "I"   "'"    "d"  "pay"     "$"       "500"     "."       "00"      "to"         
## [20] "rid"   "them"    "."  

Here's what I'm after:

## [[1]]
##  [1] "Raptors" "don't"       "like"    "robots"  "!"       ""        "I'd"      
##  [8] "pay"     "$"       "500"   "."   "00"  "to"      "rid"     "them"    "."  

While I want a base solution I would like to see other solutions (I'm sure someone has a stringr solution) which makes the question more generalizable to others.

Note: R has a specific regex system. You will want to be familiar with R to answer this question.

like image 664
Tyler Rinker Avatar asked Mar 06 '14 20:03

Tyler Rinker


1 Answers

You could use a negative lookahead (?!'):

strsplit(x, "(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)
#  [1] "Raptors" "don't"   "like"    "robots"  "!"       ""        "I'd"     "pay"     "$"       "500"     "."       "00"      "to"      "rid"     "them"    "."
like image 81
sgibb Avatar answered Oct 13 '22 20:10

sgibb