(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" without splitting abbreviations?
I know how to split the string at every uppercase letter, but doing that would split initialisms/abbreviations, like CA or USSR or even U.S.A. and I need to preserve those.
So I'm thinking some type of logical like if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character.
My snippet of code below splits words with spaces by capital letters, but it breaks initialisms like CA becomes C A undesirably.
s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \\1', s))
# "We Live In C A"
or another example...
s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \\1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"
The results I'd want would be:
"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"
But this needs to be widely applicable (not just for my example)
Try with base R gregexpr/regmatches.
s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We" "Live" "In" "CA"
#
#[[2]]
#[1] "IDon't" "Eat" "Kittens" "FYI"
#
#[[3]]
#[1] "You" "Know" "Your" "ABCs"
Explanation.
[[:upper:]]+ matches one or more upper case letters;[^[:upper:]]* matches zero or more occurrences of anything but upper case letters.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With