I have a corrupted file where Windows-Special Characters have been replaced by their UTF-8 "equivalents". I tried to write a function that is able to replace the special characters based on this table:
utf2win <- function(x){
soll <- c("À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë",
"Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø",
"Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å",
"æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò",
"ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"
)
ist <- c("À", "Ã", "Â", "Ã", "Ä", "Ã…", "Æ", "Ç", "È", "É",
"Ê", "Ë", "ÃŒ", "Ã", "ÃŽ", "Ã", "Ã", "Ñ", "Ã’", "Ó", "Ô",
"Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ãœ", "Ã", "Þ", "ß",
"Ã", "á", "â", "ã", "ä", "Ã¥", "æ", "ç", "è", "é", "ê",
"ë", "ì", "Ã", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ",
"ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ")
for(i in 1: length(ist)){
x <- gsub(ist[i], soll[i], x)
}
return(x)
}
And now for a test
a <- "Geidorf: Grabengürtel"
utf2win(a)
And nothing happens... I guess the issue is that the character "Ã" is not recognized propperly. Do you have a solution for my problem?
You can either use R base function gsub() or use str_replace() from stringr package to remove characters from a string or text.
To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.
This is an encoding problem. You may be able to fix it, but it's hard to know without the file. readBin
is a good bet if you can't force the proper encoding. Here is a summary of what I found:
I tried iconv
for the example string
iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"
And it works, but you are right that something is up with "Ã"
iconv("Geidorf: Grabengürtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA
We can see which letters are problematic:
ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"
# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"
The site you linked to has a relevant page, which spells out what the issue is:
Encoding Problem: Double Mis-Conversion
Symptom
With this particular double conversion, most characters display correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D, 0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD will show the problem. If you look at the I18nQA Encoding Debug Table you can see that these characters in UTF-8 have second bytes ending in one of the Unassigned Windows code points.
Á Í Ï Ð Ý
"à" is a different case. You have mapped it to "Ã" when it should be "Ã\u00A0" or "Ã\xA0" or "Ã " (note that the space is not a normal space; it's a non-breaking space). So, fixing that in ist
takes care of one letter.
As for the other characters ("Á", "Í", "Ï", "Ð", and "Ý"), as is, they are all mapped to "Ã" in ist
, and you'll never be able to do the appropriate substitutions as long as that's true.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With