Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove accents from a dataframe column in R

Tags:

r

diacritics

I got a data.table base. I got a term column in this data.table

class(base$term) [1] character length(base$term) [1] 27486 

I'm able to remove accents from a string. I'm able to remove accents from a vector of string.

iconv("Millésime",to="ASCII//TRANSLIT") [1] "Millesime" iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT") [1] "Millesime" "boulangere" 

But for some reason, it does not work when I apply the very same function on my term column

base$terme[2] [1] "Millésime" iconv(base$terme[2],to="ASCII//TRANSLIT") [1] "MillACsime" 

Does anybody know what is going on here?

like image 213
hans glick Avatar asked Aug 25 '16 15:08

hans glick


People also ask

How do I remove special characters from a column in R?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.

How do I remove the accented character in Python?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

How do I remove the accent from a string in Java?

Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.


1 Answers

It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv is not.

library(stringi)  base <- data.table(terme = c("Millésime",                               "boulangère",                               "üéâäàåçêëèïîì"))  base[, terme := stri_trans_general(str = terme,                                     id = "Latin-ASCII")]  > base            terme 1:     Millesime 2:    boulangere 3: ueaaaaceeeiii 
like image 94
Jeldrik Avatar answered Sep 21 '22 19:09

Jeldrik