Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table::merge How to avoid encoding warnings with merge?

Using merge of data.table I get an encoding warnings. My process is as fellow:

  1. I am creating a first data.table
  2. I update this data.table using merge.

But I when I call merge I get this warning :

Please ensure that character columns have identical encodings for joins.

How can I tell data.table of the encoding used? I know I can remove warning using suppressWarnings but I prefer to fix this since in a clean way.

This reproduces the warining:

library(data.table)
options(stringsAsFactors=FALSE)
dt = data.table(text=c('é','à','s'),
                title='agstudy',hrefs='a')
setkeyv(dt,names(dt))  
dt.new = data.table(text=c('é','à','h','a'),
                    hrefs=c(rep('a',2),rep('aa',2)),
                    title=c(rep('agstudy',2),rep('new',2)))
setkeyv(dt.new,names(dt.new))
merge(dt.new,dt,all=TRUE)

Warning messages:
1: In `[.data.table`(y, xkey, nomatch = ifelse(all.x, NA, 0), allow.cartesian = allow.cartesian) :
  Encoding of character column 'text' in X is different from column 'text' in Y 
  in join X[Y]. Joins are not implemented yet for non-identical character encodings 
  and therefore likely to contain unexpected results for those entries. 
  Please ensure that character columns have identical encodings for joins.

EDIT add some session information:

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
[1] data.table_1.8.11

EDIT2 add some context

My data.table is created after some scraping where I set the encoding to UTF-8 using htmlParse(...,encoding='UTF-8') then I am creating the data.table using the scraped text.

like image 888
agstudy Avatar asked Feb 27 '14 14:02

agstudy


2 Answers

The warning results from a mixture of encodings in your character vectors. The ascii characters have encoding "unknown", but others are probably "latin1".

Use this to convert all encodings to unknown:

dt[, names(dt) := lapply(.SD, function(x) {if (is.character(x)) Encoding(x) <- "unknown"; x})]

If you do the same for the second DT, you avoid the warning.

Note that you are using a development version. The behaviour could change soon.

like image 188
Roland Avatar answered Nov 01 '22 03:11

Roland


The encoding issues are fixed in v1.9.7 (current devel). See ReleaseNotes, Bug Fixes #23. This should work as intended without any warnings or need for conversion of encodings. Please report back if it doesn't.

require(data.table) # v1.9.7+
dt = data.table(text=c('é','à','s'), title='agstudy',hrefs='a')
dt.new = data.table(text=c('é','à','h','a'), hrefs=c(rep('a',2),rep('aa',2)), title=c(rep('agstudy',2),rep('new',2)))

merge(dt.new, dt, all=TRUE)
#    text hrefs   title
# 1:    a    aa     new
# 2:    h    aa     new
# 3:    s     a agstudy
# 4:    à     a agstudy
# 5:    é     a agstudy

merge(dt.new, dt, all=TRUE, by=c("text", "title"))
#    text   title hrefs.x hrefs.y
# 1:    a     new      aa      NA
# 2:    h     new      aa      NA
# 3:    s agstudy      NA       a
# 4:    à agstudy       a       a
# 5:    é agstudy       a       a
like image 2
Arun Avatar answered Nov 01 '22 04:11

Arun