Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lower Case Certain Words R

Tags:

regex

r

I need to convert certain words to lower case. I am working with a list of movie titles, where prepositions and articles are normally lower case if they are not the first word in the title. If I have the vector:

movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')

What I need is this:

movies_updated = c('The Kings of Summer', 'The Words', 'Out of the Furnace', 'Me and Earl and the Dying Girl')

Is there an elegant way to do this without using a long series of gsub(), as in:

movies_updated = gsub(' In ', ' in ', movies)
movies_updated = gsub(' In', ' in', movies_updated)
movies_updated = gsub(' Of ', ' of ', movies)
movies_updated = gsub(' Of', ' of', movies_updated)
movies_updated = gsub(' The ', ' the ', movies)
movies_updated = gsub(' the', ' the', movies_updated)

And so on.

like image 948
tsouchlarakis Avatar asked Nov 30 '22 22:11

tsouchlarakis


2 Answers

In effect, it appears that you are interested in converting your text to title case. This can be easily achieved with use of the stringi package, as shown below:

>> stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings Of Summer" "The Words"           "Out Of The Furnace"

Alternative approach would involve making use of the toTitleCase function available in the the tools package:

>> tools::toTitleCase(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings of Summer" "The Words"           "Out of the Furnace" 
like image 150
Konrad Avatar answered Dec 02 '22 11:12

Konrad


Though I like @Konrad's answer for its succinctness, I'll offer an alternative that is more literal and manual.

movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace',
           'Me And Earl And The Dying Girl')

gr <- gregexpr("(?<!^)\\b(of|in|the)\\b", movies, ignore.case = TRUE, perl = TRUE)
mat <- regmatches(movies, gr)
regmatches(movies, gr) <- lapply(mat, tolower)
movies
# [1] "The Kings of Summer"            "The Words"                     
# [3] "Out of the Furnace"             "Me And Earl And the Dying Girl"

The tricks of the regular expression:

  • (?<!^) ensures we don't match a word at the beginning of a string. Without this, the first The of movies 1 and 2 will be down-cased.
  • \\b sets up word-boundaries, such that in in the middle of Dying will not match. This is slightly more robust than your use of space, since hyphens, commas, etc, will not be spaces but do indicate the beginning/end of a word.
  • (of|in|the) matches any one of of, in, or the. More patterns can be added with separating pipes |.

Once identified, it's as simple as replacing them with down-cased versions.

like image 22
r2evans Avatar answered Dec 02 '22 11:12

r2evans