Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing special characters from different encodings in r

I have a corrupted file where Windows-Special Characters have been replaced by their UTF-8 "equivalents". I tried to write a function that is able to replace the special characters based on this table:

utf2win <- function(x){
soll <- c("À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", 
  "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø", 
  "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", 
  "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", 
  "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"
)

ist <- c("À", "Ã", "Â", "Ã", "Ä", "Ã…", "Æ", "Ç", "È", "É", 
  "Ê", "Ë", "ÃŒ", "Ã", "ÃŽ", "Ã", "Ã", "Ñ", "Ã’", "Ó", "Ô", 
  "Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ãœ", "Ã", "Þ", "ß", 
  "Ã", "á", "â", "ã", "ä", "Ã¥", "æ", "ç", "è", "é", "ê", 
  "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", 
  "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ")


     for(i in 1: length(ist)){
          x <- gsub(ist[i], soll[i], x)
     }
  return(x)
}

And now for a test

a <- "Geidorf: Grabengürtel"

utf2win(a)

And nothing happens... I guess the issue is that the character "Ã" is not recognized propperly. Do you have a solution for my problem?

like image 880
Seb Avatar asked Jan 08 '16 15:01

Seb


People also ask

How do I remove special characters from a variable in R?

You can either use R base function gsub() or use str_replace() from stringr package to remove characters from a string or text.

How do I remove special characters from a column name in R?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.


1 Answers

This is an encoding problem. You may be able to fix it, but it's hard to know without the file. readBin is a good bet if you can't force the proper encoding. Here is a summary of what I found:

I tried iconv for the example string

iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"

And it works, but you are right that something is up with "Ã"

iconv("Geidorf: Grabengürtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA

We can see which letters are problematic:

ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"

# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"

The site you linked to has a relevant page, which spells out what the issue is:

Encoding Problem: Double Mis-Conversion

Symptom

With this particular double conversion, most characters display correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D, 0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD will show the problem. If you look at the I18nQA Encoding Debug Table you can see that these characters in UTF-8 have second bytes ending in one of the Unassigned Windows code points.

Á Í Ï Ð Ý


"à" is a different case. You have mapped it to "Ã" when it should be "Ã\u00A0" or "Ã\xA0" or "à" (note that the space is not a normal space; it's a non-breaking space). So, fixing that in ist takes care of one letter.

As for the other characters ("Á", "Í", "Ï", "Ð", and "Ý"), as is, they are all mapped to "Ã" in ist, and you'll never be able to do the appropriate substitutions as long as that's true.

like image 197
Jota Avatar answered Oct 09 '22 01:10

Jota