Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing all umlauts simultaneously in R (using regex)

Tags:

regex

r

I have text in German and I want to replace all umlauts (ä, Ä, ü, Ü, ö, Ö) with ae, oe, ue, etc.

I can do it separately (by saving each substitution into a new file):

gsub(pattern = '[ä]', replacement = "ae",text)
gsub(pattern = '[ü]', replacement = "ue",text)
gsub(pattern = '[ö]', replacement = "oe",text)

But can I do it in one command (including substituting capital letters with Ae, Oe and Ue, etc.)?

Can I do it by regex?

like image 753
Daniel Yefimov Avatar asked Oct 29 '16 20:10

Daniel Yefimov


2 Answers

Some solutions here might or might not work depending on the locale of the OS running R and the encoding of the input string. I had this problem many times on different OS and different language settings. Currently I am developing R using German Windows 10 but sometimes run the code on an English Ubuntu VM.

A very fast and reliable solution under both Windows and Ubuntu, both de_DE and en_US is this solution: https://github.com/gagolews/stringi/issues/269#issuecomment-488623874

> stringi::stri_trans_general("ä ö ü ß", "de-ASCII; Latin-ASCII")
[1] "ae oe ue ss"

The ; inside of the ICU transform id makes this a 'compound id'. See ?stri_trans_general for more info.

like image 84
zerweck Avatar answered Nov 07 '22 06:11

zerweck


You could try

# install.packages("stringi) # uncomment & run if needed
str <- c("äöü", "ÄÖÜ")
stringi::stri_replace_all_fixed(
  str, 
  c("ä", "ö", "ü", "Ä", "Ö", "Ü"), 
  c("ae", "oe", "ue", "Ae", "Oe", "Ue"), 
  vectorize_all = FALSE
)
# [1] "aeoeue" "AeOeUe"
like image 10
lukeA Avatar answered Nov 07 '22 06:11

lukeA