Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert accented characters into ascii character

Tags:

What is the optimal way to to remove German (or French) accents from a vector of 16 million string variables.

e.g., 'Sjögren's syndrome' into 'Sjogren's syndrome'

Converstion of single character into a single character is better then transliteration such as

ä => ae ö => oe ü => ue.

e.g., using regular expression would be one option but is there something better (R package for this)?

gsub('ü','u',gsub('ö','o',"Sjögren's syndrome ( über) "))

There are SO solutions for non-R platforms but not a good one for R.

like image 837
userJT Avatar asked Nov 28 '12 16:11

userJT


People also ask

How do you change an accented character to a regular character?

replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

Are accented characters allowed in ASCII?

The first 128 characters must be the same as for ASCII and the rest are usually used for alphabetic letters with accents, for example like É, È, Î and Ü. This solves the problem for a few languages that are based on the Latin alphabet, although not all extended ASCII systems are the same.

What is the ASCII value of A to Z?

Below are the implementation of both methods: Using ASCII values: ASCII value of uppercase alphabets – 65 to 90. ASCII value of lowercase alphabets – 97 to 122.

How do I remove the accented character in Java?

string = string. replaceAll("[^\\p{ASCII}]", "");


2 Answers

Use iconv to convert to ASCII with transliteration (if supported):

iconv(c("über","Sjögren's"),to="ASCII//TRANSLIT") [1] "uber"      "Sjogren's" 
like image 87
James Avatar answered Oct 25 '22 14:10

James


One of the linked answers suggest

library(stringi) stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")  [1] "Zazolc gesla jazn" 
like image 42
userJT Avatar answered Oct 25 '22 14:10

userJT