Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capitalizing text in R with an exception list

Tags:

regex

r

*Apologies, I should've been more clear (I really appreciate all the help though!)

I extract from a database a .csv file. This file contains a list of place names. I use INITCAP when I extract them so they are all proper mixed case. However some of these place names need to remain capitalized as they are know abbreviations, like universities, etc. The end result will be me putting this back into the database in a corrected format.

I'm new to R and stuck on a bit of a problem. I'm extracting data that is all in capitals but I need it to be proper case i.e. change, "THIS IS ALL CAPS" to "This Is All Caps" but I need to be able to exclude certain words. Things like "FYI" and other abbreviations need to remain capitalized. I've managed to solve some of my issue with the lettercase library, particularly str_ucfirst. My only remaining issue is the exception part. Any suggestions would be appreciated. Thanks.

like image 325
kill9all Avatar asked Apr 19 '18 14:04

kill9all


People also ask

How do you capitalize words in R?

toupper() method in R programming is used to convert the lowercase string to uppercase string. Return: Returns the uppercase string.

How do I change all caps to lowercase except the first letter in R?

str_to_title() Function in R Language is used to convert the first letter of every word of a string to Uppercase and the rest of the letters are converted to lower case. Note: This function uses 'stringr' library.


1 Answers

Building on @akrun's (now deleted) solution you could form an exception vector which is then paste0d into a regular expression using (*SKIP)(*FAIL):

string <- "THIS IS ALL CAPS"
exceptions <- c("FYI", "THIS")
pattern <- sprintf("(?:%s)(*SKIP)(*FAIL)|\\b([A-Z])(\\w+)", paste0(exceptions, collapse = "|"))
gsub(pattern, "\\1\\L\\2", string, perl = TRUE)

Which yields

[1] "THIS Is All Caps"

Note the THIS which got ignored.


The pattern is
unimportant|not_important|(very important)

In terms of regex engines that support it, this is

...(*SKIP)(*FAIL)|what_i_want_to_match

In this case

\b      # a word boundary
([A-Z]) # uppercase letters
(\w+)   # [a-zA-Z0-9_]+

This is fed into the replacement subroutine.

like image 176
Jan Avatar answered Sep 22 '22 14:09

Jan