What is the optimal way to to remove German (or French) accents from a vector of 16 million string variables.
e.g., 'Sjögren's syndrome' into 'Sjogren's syndrome'
Converstion of single character into a single character is better then transliteration such as
ä => ae ö => oe ü => ue.
e.g., using regular expression would be one option but is there something better (R package for this)?
gsub('ü','u',gsub('ö','o',"Sjögren's syndrome ( über) "))
There are SO solutions for non-R platforms but not a good one for R.
replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.
The first 128 characters must be the same as for ASCII and the rest are usually used for alphabetic letters with accents, for example like É, È, Î and Ü. This solves the problem for a few languages that are based on the Latin alphabet, although not all extended ASCII systems are the same.
Below are the implementation of both methods: Using ASCII values: ASCII value of uppercase alphabets – 65 to 90. ASCII value of lowercase alphabets – 97 to 122.
string = string. replaceAll("[^\\p{ASCII}]", "");
Use iconv
to convert to ASCII with transliteration (if supported):
iconv(c("über","Sjögren's"),to="ASCII//TRANSLIT") [1] "uber" "Sjogren's"
One of the linked answers suggest
library(stringi) stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII") [1] "Zazolc gesla jazn"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With