Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a string AFTER a pattern occurs

Tags:

regex

r

stringr

I have the following vector of strings. It contains two elements. Each of the elements is composed by two collapsed phrases.

strings <- c("This is a phrase with a NameThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")

I would like to split those phrases for each element in the vector. I've been trying something like:

library(stringr)

str_split(strings, "\\B(?=[a-z|0-9][A-Z])")

which almost gives me what I'm looking for:

[[1]]
[1] "This is a phrase with a Nam" "eThis is another phrase"

[[2]]
[1] "This is a phrase with the number 201" "9This is another phrase"

I would like to make the split AFTER the pattern but cannot figure out how to do that.

I guess I'm close to a solution and would appreciate any help.

like image 391
allanvc Avatar asked Aug 22 '18 00:08

allanvc


2 Answers

You need to match the position right before the capital letters, not the position before the last letter of the initial phrase (which is one character before the position you need). You might just match a non-word boundary with lookahead for a capital letter:

str_split(strings, "\\B(?=[A-Z])")

If the phrases can contain leading capital letters, but do not contain any capital letters after the lowercase letters start, you can split them as well with lookbehind for a digit or a lowercase letter. No non-word boundary needed this time:

strings <- c("SHOCKING NEWS: someone did somethingThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")
str_split(strings, "(?<=[a-z0-9])(?=[A-Z])")
like image 110
CertainPerformance Avatar answered Sep 20 '22 23:09

CertainPerformance


Alternative solution. Look for a lowercase letter or digit followed by an uppercase letter, and split in-between.

strsplit(strings, "(?<=[[:lower:][:digit:]])(?=[[:upper:]])", perl=TRUE)

[[1]]
[1] "This is a phrase with a Name" "This is another phrase"      

[[2]]
[1] "This is a phrase with the number 2019" "This is another phrase"
like image 29
milan Avatar answered Sep 17 '22 23:09

milan