Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply Encoding to Entire Data.Table

I have the following file read into a data.table like so:

raw <- fread("avito_train.tsv", nrows=1000)

Then, if I change the encoding of a particular column and row like this:

Encoding(raw$title[2]) <- "UTF-8"

It works perfectly.

But, how can I apply the encoding to all columns, and all rows?

I checked the fread documentation but there doesn't appear to be any encoding option. Also, I tried Encoding(raw) but that gives me an error (a character vector argument expected).

Edit: This article details more information on foreign text in RStudio on Windows http://quantifyingmemory.blogspot.com/2013/01/r-and-foreign-characters.html

like image 952
user1477388 Avatar asked Jun 30 '14 14:06

user1477388


3 Answers

This has been recently implemented in the devel version of data.table, v1.9.5. This'll be soon pushed to CRAN (as v1.9.6). Could you please give the devel version a try to see if that solves this for you?

fread() has gained an encoding argument, specifically for issues with windows.

require(data.table) # v1.9.5+
fread("file.txt", encoding="UTF-8")

should solve the issue. There's no file for me to test. If it doesn't solve your issue, please file an issue on the project page, with a reproducible example/file.

like image 139
Arun Avatar answered Oct 26 '22 18:10

Arun


I tried this:

Encoding(raw$title) <- "UTF-8"

Which sets the encoding for the entire column. That will work fine for now. Still open to any other options so it will do this automatically upon import.

like image 31
user1477388 Avatar answered Oct 26 '22 19:10

user1477388


Sadly, there does not seem to be a way of doing this while importing (yet) with fread.

While you seem to have figured it out already, I'll post a way of setting the encoding of the entire dt after import.

One way of getting it done would be to loop that over all the character columns in a data table:

for (name in colnames(raw[,sapply(raw, is.character), with=F])){
  Encoding(raw[[name]]) <- "UTF-8"}

the colnames... bit first gets the columns that are characters (with=F being necessary for dt it seems), and then one gets the column names that one will loop over. In short: this gives users what you have already found works, but across all char columns.

Now ... since there's no guarantee that the colnames for your integers, floats etc will not need some massaging, the following should solve it:

for (name in colnames(raw)){
  Encoding(colnames(raw)) <- "UTF-8"
}
like image 27
Patrik Bratkovič Avatar answered Oct 26 '22 18:10

Patrik Bratkovič