Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table?

Tags:

r

data.table

While analysing some data, I came across the warning message, which I suspect to be a bug as it is a pretty straightforward command that I have worked with many times.

Warning message:
In rbindlist(allargs) : NAs introduced by coercion

I was able to reproduce the error. Here's a code with which you should be able to reproduce the error.

# unique random names for column V1
set.seed(45)
n <- sapply(1:500, function(x) {
    paste(sample(c(letters[1:26]), 10), collapse="")
})
# generate some values for V2 and V3
dt <- data.table(V1 = sample(n, 30*500, replace = TRUE), 
                 V2 = sample(1:10, 30*500, replace = TRUE), 
                 V3 = sample(50:100, 30*500, replace = TRUE))
setkey(dt, "V1")

# No warning when providing column names (and right results)
dt[, list(s = sum(V2), m = mean(V3)),by=V1]

#              V1   s        m
#   1: acgmqyuwpe 238 74.97778
#   2: adcltygwsq 204 79.94118
#   3: adftozibnh 165 75.51515
#   4: aeuowtlskr 164 75.70968
#   5: ahfoqclkpg 192 73.20000
#  ---                        
# 496: zuqegoxkpi  93 77.95000
# 497: zwpserimgf 178 72.62963
# 498: zxkpdrlcsf 154 78.04167
# 499: zxvoaeflhq 121 75.34615
# 500: zyiwcsanlm 180 76.61290

# Warning message and results with NA
dt[, list(sum(V2), mean(V3)),by=V1]

#              V1  V1       V2
#   1: acgmqyuwpe 238 74.97778
#   2: adcltygwsq 204 79.94118
#   3: adftozibnh 165 75.51515
#   4: aeuowtlskr 164 75.70968
#   5: ahfoqclkpg 192 73.20000
#  ---                        
# 496: zuqegoxkpi  NA 77.95000
# 497: zwpserimgf  NA 72.62963
# 498: zxkpdrlcsf  NA 78.04167
# 499: zxvoaeflhq  NA 75.34615
# 500: zyiwcsanlm  NA 76.61290

Warning message:
In rbindlist(allargs) : NAs introduced by coercion
  • 1) It seems that this happens if you don't provide the column names.

  • 2) Even then, in particular, when V1 (or the column you use in by=) has a lot of unique entries (500 here) and you don't specify column names, then this seems to happen. That is, this DOES NOT happen when the by= column V1 has fewer unique entries. For example, try changing the code for n from sapply(1:500, ... to sapply(1:50, ... and you'll get no warning.

What's going on here? Its R version 2.15 on Macbook pro with OS X 10.8.2 (although I tested it on another macbook pro with 2.15.2). Here's the sessionInfo().

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6 reshape2_1.2.2  

loaded via a namespace (and not attached):
[1] plyr_1.8      stringr_0.6.2 tools_2.15.0 

Just reproduced with 2.15.2:

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6
like image 999
Arun Avatar asked Jan 29 '13 13:01

Arun


People also ask

How do you solve NAs introduced by coercion?

Approach 2: Using the suppressWarnings() function to disable a warning message. You may not always wish to convert non-number values to numbers. In this scenario, just wrap the suppress warnings function around the as. numeric function to disregard the warning message “NAs introduced by coercion”.

Why are NAs introduced by coercion mean?

As you can see, the warning message “NAs introduced by coercion” is returned and some output values are NA (i.e. missing data or not available data). The reason for this is that some of the character strings are not properly formatted numbers and hence cannot be converted to the numeric class.


1 Answers

UPDATE : Now fixed in v1.8.9 by Ricardo

o rbind'ing data.tables containing duplicate, "" or NA column names now works, #2726 & #2384. Thanks to Garrett See and Arun Srinivasan for reporting. This also affected the printing of data.tables with duplicate column names since the head and tail are rbind-ed together internally.


Yes, bug. Seems to be in the print method of data.tables with duplicated names.

ans = dt[, list(sum(V2), mean(V3)),by=V1]
head(ans)
           V1  V1       V2     # notice the duplicated V1
1: acgmqyuwpe 140 78.07692
2: adcltygwsq 191 76.93333
3: adftozibnh 153 73.82143
4: aeuowtlskr 122 73.04348
5: ahfoqclkpg 143 75.83333
6: ahtczyuipw 135 73.54167
tail(ans)
           V1  V1       V2
1: zugrnehpmq 189 72.63889
2: zuqegoxkpi 150 76.03333
3: zwpserimgf 180 74.81818
4: zxkpdrlcsf 115 72.57895
5: zxvoaeflhq 157 76.53571
6: zyiwcsanlm 145 72.79167
print(ans)
Error in rbindlist(allargs) : 
    (converted from warning) NAs introduced by coercion
rbind(head(ans),tail(ans))
Error in rbindlist(allargs) : 
    (converted from warning) NAs introduced by coercion

As a work around, don't create data.table with column names V1, V2 etc.

It's arising due to this known bug :

#2384 rbind of tables containing duplicate column names doesn't bind correctly

and I've added a link there back to this question.

Thanks!

like image 93
Matt Dowle Avatar answered Sep 23 '22 12:09

Matt Dowle