I have a corrupted file where Windows-Special Characters have been replaced by their UTF-8 "equivalents". I tried to write a function that is able to replace the special characters based on this table: <pre class="prettyprint"><code>utf2win <- function(x){ soll <- c("À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ" ) ist <- c("Ã€", "Ã", "Ã&sbquo;", "Ã&fnof;", "Ã&bdquo;", "Ã…", "Ã&dagger;", "Ã&Dagger;", "Ã&circ;", "Ã&permil;", "Ã&Scaron;", "Ã&lsaquo;", "Ã&OElig;", "Ã", "ÃŽ", "Ã", "Ã", "Ã‘", "Ã’", "Ã“", "Ã”", "Ã•", "Ã–", "Ã—", "Ã&tilde;", "Ã™", "Ã&scaron;", "Ã&rsaquo;", "Ã&oelig;", "Ã", "Ãž", "Ã&Yuml;", "Ã", "Ã¡", "Ã¢", "Ã£", "Ã¤", "Ã¥", "Ã¦", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã«", "Ã¬", "Ã", "Ã®", "Ã¯", "Ã°", "Ã±", "Ã²", "Ã³", "Ã´", "Ãµ", "Ã¶", "Ã·", "Ã¸", "Ã¹", "Ãº", "Ã»", "Ã¼", "Ã½", "Ã¾", "Ã¿") for(i in 1: length(ist)){ x <- gsub(ist[i], soll[i], x) } return(x) } </code></pre> And now for a test <pre class="prettyprint"><code>a <- "Geidorf: GrabengÃ¼rtel" utf2win(a) </code></pre> And nothing happens... I guess the issue is that the character "Ã" is not recognized propperly. Do you have a solution for my problem?

This is an encoding problem. You may be able to fix it, but it's hard to know without the file. <code>readBin</code> is a good bet if you can't force the proper encoding. Here is a summary of what I found: I tried <code>iconv</code> for the example string <pre class="prettyprint"><code>iconv(a, "UTF-8", "WINDOWS-1252") #[1] "Geidorf: Grabengürtel" </code></pre> And it works, but you are right that something is up with "Ã" <pre class="prettyprint"><code>iconv("Geidorf: GrabengÃ¼rtel Ã", "UTF-8", "WINDOWS-1252") #[1] NA </code></pre> We can see which letters are problematic: <pre class="prettyprint"><code>ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))] [1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã" # corresponding characters paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]) [1] "Á" "Í" "Ï" "Ð" "Ý" "à" </code></pre> The site you linked to has a relevant page, which spells out what the issue is: <blockquote> Encoding Problem: Double Mis-Conversion Symptom With this particular double conversion, most characters display correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D, 0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD will show the problem. If you look at the I18nQA Encoding Debug Table you can see that these characters in UTF-8 have second bytes ending in one of the Unassigned Windows code points. Á Í Ï Ð Ý </blockquote> <hr> "à" is a different case. You have mapped it to "Ã" when it should be "Ã\u00A0" or "Ã\xA0" or "Ã " (note that the space is not a normal space; it's a non-breaking space). So, fixing that in <code>ist</code> takes care of one letter. As for the other characters ("Á", "Í", "Ï", "Ð", and "Ý"), as is, they are all mapped to "Ã" in <code>ist</code>, and you'll never be able to do the appropriate substitutions as long as that's true.

Replacing special characters from different encodings in r

Tags:

r

character-encoding

utf-8

windows-1252

I have a corrupted file where Windows-Special Characters have been replaced by their UTF-8 "equivalents". I tried to write a function that is able to replace the special characters based on this table:

utf2win <- function(x){
soll <- c("À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", 
  "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø", 
  "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", 
  "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", 
  "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"
)

ist <- c("Ã€", "Ã", "Ã‚", "Ãƒ", "Ã„", "Ã…", "Ã†", "Ã‡", "Ãˆ", "Ã‰", 
  "ÃŠ", "Ã‹", "ÃŒ", "Ã", "ÃŽ", "Ã", "Ã", "Ã‘", "Ã’", "Ã“", "Ã”", 
  "Ã•", "Ã–", "Ã—", "Ã˜", "Ã™", "Ãš", "Ã›", "Ãœ", "Ã", "Ãž", "ÃŸ", 
  "Ã", "Ã¡", "Ã¢", "Ã£", "Ã¤", "Ã¥", "Ã¦", "Ã§", "Ã¨", "Ã©", "Ãª", 
  "Ã«", "Ã¬", "Ã", "Ã®", "Ã¯", "Ã°", "Ã±", "Ã²", "Ã³", "Ã´", "Ãµ", 
  "Ã¶", "Ã·", "Ã¸", "Ã¹", "Ãº", "Ã»", "Ã¼", "Ã½", "Ã¾", "Ã¿")


     for(i in 1: length(ist)){
          x <- gsub(ist[i], soll[i], x)
     }
  return(x)
}

And now for a test

a <- "Geidorf: GrabengÃ¼rtel"

utf2win(a)

And nothing happens... I guess the issue is that the character "Ã" is not recognized propperly. Do you have a solution for my problem?

880

asked Jan 08 '16 15:01

Seb

1 Answers

This is an encoding problem. You may be able to fix it, but it's hard to know without the file. readBin is a good bet if you can't force the proper encoding. Here is a summary of what I found:

I tried iconv for the example string

iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"

And it works, but you are right that something is up with "Ã"

iconv("Geidorf: GrabengÃ¼rtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA

We can see which letters are problematic:

ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"

# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"

The site you linked to has a relevant page, which spells out what the issue is:

Encoding Problem: Double Mis-Conversion

Symptom

With this particular double conversion, most characters display correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D, 0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD will show the problem. If you look at the I18nQA Encoding Debug Table you can see that these characters in UTF-8 have second bytes ending in one of the Unassigned Windows code points.

Á Í Ï Ð Ý

"à" is a different case. You have mapped it to "Ã" when it should be "Ã\u00A0" or "Ã\xA0" or "Ã " (note that the space is not a normal space; it's a non-breaking space). So, fixing that in ist takes care of one letter.

As for the other characters ("Á", "Í", "Ï", "Ð", and "Ý"), as is, they are all mapped to "Ã" in ist, and you'll never be able to do the appropriate substitutions as long as that's true.

197

answered Oct 09 '22 01:10

Jota

Related questions
                            
                                remove a temporary environment variable and release memory in R
                            
                                How to make labels in the legend align right in R?
                            
                                Why is the default return type of `ceiling` and `floor` numeric?
                            
                                How to exactly remove the punctuation when using R with tm package
                            
                                Sum until a given value is reached
                            
                                Get all possible combinations by row in matrix
                            
                                Dendrogram edge (branch) colors to match tip (leaf) colors (ape package)
                            
                                why are these memoised functions different?
                            
                                calculating distance between two row in a data.table
                            
                                R shiny login hack
                            
                                Switching between menuSubItems in shinyDashboard
                            
                                Making a package in R that depends on data.table
                            
                                Automatically saving interactive graph in R to a specified location as a .html file
                            
                                download rpivotTable output in shiny
                            
                                Rjags error message: Dimension mismatch
                            
                                Put tick labels of only x-axis inside plotting area
                            
                                Is there a way to show overlapping histograms in R without adjusting transparency?
                            
                                How to generate spatial points with a pattern
                            
                                How to use dbGetQuery in tryCatch with PostgreSQL?
                            
                                conditionalPanel in Shiny not working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With