I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with <code>read.csv</code>. But I am not able to detect the correct encoding. It seems to be a kind of UTF-8. I am using R 2.12.1 on an WindowsXP Machine. Any Help?

First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty. I've struggle this many times and come to non-automatic solution: Use <code>iconvlist</code> to get all possible encodings: <pre class="prettyprint"><code>codepages <- setNames(iconvlist(), iconvlist()) </code></pre> Then read data using each of them <pre class="prettyprint"><code>x <- lapply(codepages, function(enc) try(read.table("encoding.asc", fileEncoding=enc, nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here </code></pre> Important here is to know structure of file (separator, headers). Set encoding using <code>fileEncoding</code> argument. Read only few rows. Now you could lookup on results: <pre class="prettyprint"><code>unique(do.call(rbind, sapply(x, dim))) # [,1] [,2] # 437 14 2 # CP1200 3 29 # CP12000 0 1 </code></pre> Seems like correct one is that with 3 rows and 29 columns, so lets see them: <pre class="prettyprint"><code>maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29)))) codepages[maybe_ok] # CP1200 UCS-2LE UTF-16 UTF-16LE UTF16 UTF16LE # "CP1200" "UCS-2LE" "UTF-16" "UTF-16LE" "UTF16" "UTF16LE" </code></pre> You could look on data too <pre class="prettyprint"><code>x[maybe_ok] </code></pre> For your file all this encodings returns identical data (partially because there is some redundancy as you see). If you don't know specific of your file you need to use <code>readLines</code> with some changes in workflow (e.g. you can't use <code>fileEncoding</code>, must use <code>length</code> instead of <code>dim</code>, do more magic to find correct ones).

How to detect the right encoding for read.csv?

Tags:

r

character-encoding

read.csv

I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with read.csv. But I am not able to detect the correct encoding. It seems to be a kind of UTF-8. I am using R 2.12.1 on an WindowsXP Machine. Any Help?

698

asked Jan 26 '11 16:01

Alex

1 Answers

First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.

I've struggle this many times and come to non-automatic solution:

Use iconvlist to get all possible encodings:

codepages <- setNames(iconvlist(), iconvlist())

Then read data using each of them

x <- lapply(codepages, function(enc) try(read.table("encoding.asc",                    fileEncoding=enc,                    nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here

Important here is to know structure of file (separator, headers). Set encoding using fileEncoding argument. Read only few rows.
Now you could lookup on results:

unique(do.call(rbind, sapply(x, dim))) #        [,1] [,2] # 437       14    2 # CP1200     3   29 # CP12000    0    1

Seems like correct one is that with 3 rows and 29 columns, so lets see them:

maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29)))) codepages[maybe_ok] #    CP1200    UCS-2LE     UTF-16   UTF-16LE      UTF16    UTF16LE  #  "CP1200"  "UCS-2LE"   "UTF-16" "UTF-16LE"    "UTF16"  "UTF16LE"

You could look on data too

x[maybe_ok]

For your file all this encodings returns identical data (partially because there is some redundancy as you see).

If you don't know specific of your file you need to use readLines with some changes in workflow (e.g. you can't use fileEncoding, must use length instead of dim, do more magic to find correct ones).

175

answered Sep 21 '22 11:09

Marek

Related questions
                            
                                In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)
                            
                                Convert data frame with date column to timeseries
                            
                                How can I make consistent-width plots in ggplot (with legends)?
                            
                                Parallelism in Julia: Native Threading Support
                            
                                Apply function to each column in a data frame observing each columns existing data type
                            
                                write.csv for large data.table
                            
                                Error in file(file, "rt") : cannot open the connection [duplicate]
                            
                                Function default arguments and named values
                            
                                Namespaces in R packages
                            
                                Count number of records and generate row number within each group in a data.table
                            
                                What is the knitr equivalent of `R CMD Sweave myfile.rnw`?
                            
                                run a for loop in parallel in R
                            
                                How to collapse a list of characters into a single string in R
                            
                                How to change the locale of R?
                            
                                Read SPSS file into R
                            
                                "Out of Memory Error (Java)" when using R and XLConnect package
                            
                                Reset the graphical parameters back to default values without use of dev.off()
                            
                                How do I generate a list with a specified increment step?
                            
                                How to select R data.table rows based on substring match (a la SQL like)
                            
                                R markdown: Accessing variable from code chunk (variable scope) [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With