Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a vector of strings to Title Case

Tags:

regex

replace

r

I have a vector of strings in lower case. I'd like to change them to title case, meaning the first letter of every word would be capitalized. I've managed to do it with a double loop, but I'm hoping there's a more efficient and elegant way to do it, perhaps a one-liner with gsub and a regex.

Here's some sample data, along with the double loop that works, followed by other things I tried that didn't work.

strings = c("first phrase", "another phrase to convert",
            "and here's another one", "last-one")

# For each string in the strings vector, find the position of each 
#  instance of a space followed by a letter
matches = gregexpr("\\b[a-z]+", strings) 

# For each string in the strings vector, convert the first letter 
#  of each word to upper case
for (i in 1:length(strings)) {

  # Extract the position of each regex match for the string in row i
  #  of the strings vector.
  match.positions = matches[[i]][1:length(matches[[i]])] 

  # Convert the letter in each match position to upper case
  for (j in 1:length(match.positions)) {

    substr(strings[i], match.positions[j], match.positions[j]) = 
      toupper(substr(strings[i], match.positions[j], match.positions[j]))
  }
}

This worked, but it seems inordinately complicated. I resorted to it only after experimenting unsuccessfully with more straightforward approaches. Here are some of the things I tried, along with the output:

# Google search suggested \\U might work, but evidently not in R
gsub("(\\b[a-z]+)", "\\U\\1" ,strings)
[1] "Ufirst Uphrase"                "Uanother Uphrase Uto Uconvert"
[3] "Uand Uhere'Us Uanother Uone"   "Ulast-Uone"                   

# I tried this on a lark, but to no avail
gsub("(\\b[a-z]+)", toupper("\\1"), strings)
[1] "first phrase"              "another phrase to convert"
[3] "and here's another one"    "last-one"  

The regex captures the correct positions in each string as shown by a call to gregexpr, but the replacement string is clearly not working as desired.

If you can't already tell, I'm relatively new to regexes and would appreciate help on how to get the replacement to work correctly. I'd also like to learn how to structure the regex so as to avoid capturing a letter after an apostrophe, since I don't want to change the case of those letters.

like image 843
eipi10 Avatar asked Apr 03 '13 00:04

eipi10


1 Answers

The main problem is that you're missing perl=TRUE (and your regex is slightly wrong, although that may be a result of flailing around to try to fix the first problem).

Using [:lower:] instead of [a-z] is slightly safer in case your code ends up being run in some weird (sorry, Estonians) locale where z is not the last letter of the alphabet ...

re_from <- "\\b([[:lower:]])([[:lower:]]+)"
strings <- c("first phrase", "another phrase to convert",
             "and here's another one", "last-one")
gsub(re_from, "\\U\\1\\L\\2" ,strings, perl=TRUE)
## [1] "First Phrase"              "Another Phrase To Convert"
## [3] "And Here's Another One"    "Last-One"    

You may prefer to use \\E (stop capitalization) rather than \\L (start lowercase), depending on what rules you want to follow, e.g.:

string2 <- "using AIC for model selection"
gsub(re_from, "\\U\\1\\E\\2" ,string2, perl=TRUE)
## [1] "Using AIC For Model Selection"
like image 129
Ben Bolker Avatar answered Sep 22 '22 08:09

Ben Bolker