I got a data.table base. I got a term column in this data.table
class(base$term) [1] character length(base$term) [1] 27486
I'm able to remove accents from a string. I'm able to remove accents from a vector of string.
iconv("Millésime",to="ASCII//TRANSLIT") [1] "Millesime" iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT") [1] "Millesime" "boulangere"
But for some reason, it does not work when I apply the very same function on my term column
base$terme[2] [1] "Millésime" iconv(base$terme[2],to="ASCII//TRANSLIT") [1] "MillACsime"
Does anybody know what is going on here?
To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.
We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.
Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.
It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv
is not.
library(stringi) base <- data.table(terme = c("Millésime", "boulangère", "üéâäàåçêëèïîì")) base[, terme := stri_trans_general(str = terme, id = "Latin-ASCII")] > base terme 1: Millesime 2: boulangere 3: ueaaaaceeeiii
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With