(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

Question

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" without splitting abbreviations?

I know how to split the string at every uppercase letter, but doing that would split initialisms/abbreviations, like CA or USSR or even U.S.A. and I need to preserve those.

So I'm thinking some type of logical like if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character.

My snippet of code below splits words with spaces by capital letters, but it breaks initialisms like CA becomes C A undesirably.

s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \1', s))
# "We Live In C A"

or another example...

s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"

The results I'd want would be:

"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"

But this needs to be widely applicable (not just for my example)

Rui Barradas · Accepted Answer

Try with base R gregexpr/regmatches.

s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We"   "Live" "In"   "CA"  
#
#[[2]]
#[1] "IDon't"  "Eat"     "Kittens" "FYI"    
#
#[[3]]
#[1] "You"  "Know" "Your" "ABCs"

Explanation.

[[:upper:]]+ matches one or more upper case letters;
[^[:upper:]]* matches zero or more occurrences of anything but upper case letters.
In sequence these two regular expressions match words starting with upper case letter(s) followed by something else.

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

Tags:

string

regex

split

r

Samantha Karlaina Rhoads

1 Answers

Rui Barradas

Recent Activity

Donate For Us

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

Tags:

string

regex

split

r

Samantha Karlaina Rhoads

1 Answers

Rui Barradas

Related questions

Recent Activity

Donate For Us