Using merge
of data.table
I get an encoding warnings. My process is as fellow:
merge
. But I when I call merge
I get this warning :
Please ensure that character columns have identical encodings for joins.
How can I tell data.table of the encoding used? I know I can remove warning using suppressWarnings
but I prefer to fix this since in a clean way.
This reproduces the warining:
library(data.table)
options(stringsAsFactors=FALSE)
dt = data.table(text=c('é','à','s'),
title='agstudy',hrefs='a')
setkeyv(dt,names(dt))
dt.new = data.table(text=c('é','à','h','a'),
hrefs=c(rep('a',2),rep('aa',2)),
title=c(rep('agstudy',2),rep('new',2)))
setkeyv(dt.new,names(dt.new))
merge(dt.new,dt,all=TRUE)
Warning messages:
1: In `[.data.table`(y, xkey, nomatch = ifelse(all.x, NA, 0), allow.cartesian = allow.cartesian) :
Encoding of character column 'text' in X is different from column 'text' in Y
in join X[Y]. Joins are not implemented yet for non-identical character encodings
and therefore likely to contain unexpected results for those entries.
Please ensure that character columns have identical encodings for joins.
EDIT add some session information:
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
[1] data.table_1.8.11
EDIT2 add some context
My data.table is created after some scraping where I set the encoding to UTF-8
using htmlParse(...,encoding='UTF-8') then I am creating the data.table using the scraped text.
The warning results from a mixture of encodings in your character vectors. The ascii characters have encoding "unknown", but others are probably "latin1".
Use this to convert all encodings to unknown:
dt[, names(dt) := lapply(.SD, function(x) {if (is.character(x)) Encoding(x) <- "unknown"; x})]
If you do the same for the second DT, you avoid the warning.
Note that you are using a development version. The behaviour could change soon.
The encoding issues are fixed in v1.9.7 (current devel). See ReleaseNotes, Bug Fixes #23. This should work as intended without any warnings or need for conversion of encodings. Please report back if it doesn't.
require(data.table) # v1.9.7+
dt = data.table(text=c('é','à','s'), title='agstudy',hrefs='a')
dt.new = data.table(text=c('é','à','h','a'), hrefs=c(rep('a',2),rep('aa',2)), title=c(rep('agstudy',2),rep('new',2)))
merge(dt.new, dt, all=TRUE)
# text hrefs title
# 1: a aa new
# 2: h aa new
# 3: s a agstudy
# 4: à a agstudy
# 5: é a agstudy
merge(dt.new, dt, all=TRUE, by=c("text", "title"))
# text title hrefs.x hrefs.y
# 1: a new aa NA
# 2: h new aa NA
# 3: s agstudy NA a
# 4: à agstudy a a
# 5: é agstudy a a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With