Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using str_replace_all or gsub (regex) to replace each occurrence of substring with unique substring - R

I have a column in a dataframe with street addresses like this:

df_col <- c("100 W 10th St", "200 Drury Ln 2a", "300 W 10th St", "400b Drury Ln")

I want to capitalize any single lower-case letters that immediately follow digits like this:

df_col <- c("100 W 10th St", "200 Drury Ln 2A", "300 W 10th St", "400B Drury Ln")

I have been able to use str_detect from the stringr package to detect substrings with digits followed by a single lower-case letter:

df %>% 
    filter(str_detect(df_col, "\\b\\d+[a-z]\\b")) 

This is my first time writing regex, explained as followed:

  • \\b matches the boundary of a word (or substring)

  • \\d matches any digit and the + is to match additional digits that follow the first digit if applicable

  • [a-z] matches one lower-case letter (any letter)

However, I am struggling to figure out how to replace each of these substrings with the same substring but a capitalized letter.

I have tried using str_replace_all, but I cannot figure out the third argument. I thought I could do something like this, but it is replacing each substring with the literal regex.

df %>% 
    mutate(df_col = str_replace_all(df_col, "\\b\\d+[a-z]\\b", "\\b\\d+[A-Z]\\b"))

I tried using gsub with mutate but could not figure that out either. I would prefer to learn a solution for str_replace_all, but other ways of solving the problem are welcome.

like image 988
Adam F Avatar asked Sep 03 '25 03:09

Adam F


2 Answers

Aiming for a simpler solution, this will match a number followed by a single word character and a word boundary and run toupper() on the match to capitalize it. Since toupper() will have no effect on the numeric part of the string, we don't have to worry about look ahead/behind or anything. more complicated.

library(stringr)
str_replace_all(
    df_col,
    pattern = "\\d\\w\\b",
    replacement = toupper
  )
# [1] "100 W 10th St"   "200 Drury Ln 2A" "300 W 10th St"   "400B Drury Ln"  
like image 85
Gregor Thomas Avatar answered Sep 04 '25 23:09

Gregor Thomas


You can use gsub like this:

gsub("(\\d\\p{Ll})\\b", "\\U\\1", df_col, perl=TRUE)
## Or, if you must ensure the matches are whole words starting with a number and then a letter:
gsub("\\b(\\d+\\p{Ll})\\b", "\\U\\1", df_col, perl=TRUE)

See the online R demo.

Details

  • (\d\p{Ll})\b matches and captured into Group 1 a digit (\d) and then a lowercase letter (\p{Ll}) that is not followed by a letter, digit or underscore (\b)
  • perl=TRUE enables the \U operator in the replacement pattern (that turned the replacement text to upper case) and also the use of Unicode category classes in the regex (like \p{X})
like image 29
Wiktor Stribiżew Avatar answered Sep 04 '25 21:09

Wiktor Stribiżew