I have the following vector of strings. It contains two elements. Each of the elements is composed by two collapsed phrases.
strings <- c("This is a phrase with a NameThis is another phrase",
"This is a phrase with the number 2019This is another phrase")
I would like to split those phrases for each element in the vector. I've been trying something like:
library(stringr)
str_split(strings, "\\B(?=[a-z|0-9][A-Z])")
which almost gives me what I'm looking for:
[[1]]
[1] "This is a phrase with a Nam" "eThis is another phrase"
[[2]]
[1] "This is a phrase with the number 201" "9This is another phrase"
I would like to make the split AFTER the pattern but cannot figure out how to do that.
I guess I'm close to a solution and would appreciate any help.
You need to match the position right before the capital letters, not the position before the last letter of the initial phrase (which is one character before the position you need). You might just match a non-word boundary with lookahead for a capital letter:
str_split(strings, "\\B(?=[A-Z])")
If the phrases can contain leading capital letters, but do not contain any capital letters after the lowercase letters start, you can split them as well with lookbehind for a digit or a lowercase letter. No non-word boundary needed this time:
strings <- c("SHOCKING NEWS: someone did somethingThis is another phrase",
"This is a phrase with the number 2019This is another phrase")
str_split(strings, "(?<=[a-z0-9])(?=[A-Z])")
Alternative solution. Look for a lowercase letter or digit followed by an uppercase letter, and split in-between.
strsplit(strings, "(?<=[[:lower:][:digit:]])(?=[[:upper:]])", perl=TRUE)
[[1]]
[1] "This is a phrase with a Name" "This is another phrase"
[[2]]
[1] "This is a phrase with the number 2019" "This is another phrase"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With