I have the following file read into a data.table like so:
raw <- fread("avito_train.tsv", nrows=1000)
Then, if I change the encoding of a particular column and row like this:
Encoding(raw$title[2]) <- "UTF-8"
It works perfectly.
But, how can I apply the encoding to all columns, and all rows?
I checked the fread
documentation but there doesn't appear to be any encoding option. Also, I tried Encoding(raw)
but that gives me an error (a character vector argument expected).
Edit: This article details more information on foreign text in RStudio on Windows http://quantifyingmemory.blogspot.com/2013/01/r-and-foreign-characters.html
This has been recently implemented in the devel version of data.table, v1.9.5. This'll be soon pushed to CRAN (as v1.9.6). Could you please give the devel version a try to see if that solves this for you?
fread()
has gained an encoding
argument, specifically for issues with windows.
require(data.table) # v1.9.5+
fread("file.txt", encoding="UTF-8")
should solve the issue. There's no file for me to test. If it doesn't solve your issue, please file an issue on the project page, with a reproducible example/file.
I tried this:
Encoding(raw$title) <- "UTF-8"
Which sets the encoding for the entire column. That will work fine for now. Still open to any other options so it will do this automatically upon import.
Sadly, there does not seem to be a way of doing this while importing (yet) with fread.
While you seem to have figured it out already, I'll post a way of setting the encoding of the entire dt after import.
One way of getting it done would be to loop that over all the character columns in a data table:
for (name in colnames(raw[,sapply(raw, is.character), with=F])){
Encoding(raw[[name]]) <- "UTF-8"}
the colnames... bit first gets the columns that are characters (with=F being necessary for dt it seems), and then one gets the column names that one will loop over. In short: this gives users what you have already found works, but across all char columns.
Now ... since there's no guarantee that the colnames for your integers, floats etc will not need some massaging, the following should solve it:
for (name in colnames(raw)){
Encoding(colnames(raw)) <- "UTF-8"
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With