Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fread() fails with missing values in integer64 columns

Tags:

r

data.table

When reading the text below, fread() fails to detect the missing values in columns 8 and 9. This is only with the default option integer64="integer64". Setting integer64="double" or "character" correctly detects NAs. Note that the file has three types of possible NAs in V8 and V9-- ,,; , ,; and NA. Appending na.strings=c("NA","N/A",""," "), sep="," as options has no effect.

Using read.csv() works the same way as fread(integer="double").

Text to be read (also available as a file integer64_and_NA.csv):

2012,276,,0,"S1","001",1,,724135215,1590915056,
2012,276,2,8,"S1","001",1, ,,154598,0
2012,276,2,12,"S1","001",1,NA,5118863,21819477,
2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0

Here's the output from fread():

DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",")

Input contains no \n. Taking this to be a filename to open
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok
Found 11 columns
First row with 11 fields occurs on line 1 (either column names or first row of data)
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 5
Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows
Type codes: 11114412221 (first 5 rows)
Type codes: 11114412221 (after applying colClasses and integer64)
Type codes: 11114412221 (after applying drop or select (if supplied)
Allocating 11 column slots (11 - 0 NULL)
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 4x11 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.001s        Total

The resulting data.table is:

DT
     V1  V2 V3 V4 V5  V6 V7                  V8                  V9        V10 V11
1: 2012 276 NA  0 S1 001  1 9218868437227407266           724135215 1590915056  NA
2: 2012 276  2  8 S1 001  1 9218868437227407266 9218868437227407266     154598   0
3: 2012 276  2 12 S1 001  1 9218868437227407266             5118863   21819477  NA
4: 2012 276  2  0 S1 011  8          3127133583          3127133583 9003982501   0

NA values are properly detected in columns which are not integer64. For V8 and V9, which fread() marks as integer64, instead of NAs we have "9218868437227407266". Interestingly enough, str() returns the respective values of V8 and V9 as NA:

str(DT)

Classes ‘data.table’ and 'data.frame':  4 obs. of  11 variables:
 $ V1 : int  2012 2012 2012 2012
 $ V2 : int  276 276 276 276
 $ V3 : int  NA 2 2 2
 $ V4 : int  0 8 12 0
 $ V5 : chr  "S1" "S1" "S1" "S1"
 $ V6 : chr  "001" "001" "001" "011"
 $ V7 : int  1 1 1 8
 $ V8 :Class 'integer64'  num [1:4] NA NA NA 1.55e-314
 $ V9 :Class 'integer64'  num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314
 $ V10:Class 'integer64'  num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314
 $ V11: int  NA 0 NA 0
 - attr(*, ".internal.selfref")=<externalptr> 

... but nothing else sees them as NA:

is.na(DT$V8)
[1] FALSE FALSE FALSE FALSE
max(DT$V8)
integer64
[1] 9218868437227407266
> max(DT$V8, na.rm=TRUE)
integer64
[1] 9218868437227407266
> class(DT$V8)
[1] "integer64"
> typeof(DT$V8)
[1] "double"

It does not seem to be a print/screen issue only, data.table sees them as huge integers:

DT[, V12:=as.numeric(V8)]
Warning message:
In as.double.integer64(V8) :
  integer precision lost while converting to double
> DT
     V1  V2 V3 V4 V5  V6 V7                  V8                  V9        V10 V11          V12
1: 2012 276 NA  0 S1 001  1 9218868437227407266           724135215 1590915056  NA 9.218868e+18
2: 2012 276  2  8 S1 001  1 9218868437227407266 9218868437227407266     154598   0 9.218868e+18
3: 2012 276  2 12 S1 001  1 9218868437227407266             5118863   21819477  NA 9.218868e+18
4: 2012 276  2  0 S1 011  8          3127133583          3127133583 9003982501   0 3.127134e+09

Am I missing something about integer64, or is this a bug? As said above, I can get around using integer64="double", possibly losing some precision, as mentioned in the help file. But the unexpected behavior is with the default integer64...

This was done on a Windows 8.1 64-bit machine running Revolution R 3.0.2, and also on a virtual machine running kubuntu 13.10, CRAN-R 3.0.2. Tested with the latest stable data.table from CRAN (1.8.10 as of 7 Feb 2014) and 1.8.11 (rev. 1110, 2014-02-04 02:43:19, manually installed from the zip as the r-forge build is broken) on Windows, and only the stable 1.8.10 on linux. bit64 is installed and loaded on both machines.

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bit64_0.9-3       bit_1.1-11        gdata_2.13.2      xts_0.9-7         zoo_1.7-10        nlme_3.1-113      hexbin_1.26.3     lattice_0.20-24   ggplot2_0.9.3.1  
[10] plyr_1.8          reshape2_1.2.2    data.table_1.8.11 Revobase_7.0.0    RevoMods_7.0.0    RevoScaleR_7.0.0 

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4       foreach_1.4.1      gtable_0.1.2       gtools_3.2.1       iterators_1.0.6   
 [9] labeling_0.2       MASS_7.3-29        munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5 reshape_0.8.4      scales_0.2.3       stringr_0.6.2     
[17] tools_3.0.2      
like image 391
Peter Avatar asked Feb 07 '14 12:02

Peter


2 Answers

This bug, #488, is now fixed with this commit in development version of data.table v1.9.5, and values are assigned (and displayed) properly as NA if bit64 is loaded.

require(data.table) # v1.9.5
require(bit64)
ans = fread("test.csv")
#      V1  V2 V3 V4 V5  V6 V7         V8         V9        V10 V11
# 1: 2012 276 NA  0 S1 001  1         NA  724135215 1590915056  NA
# 2: 2012 276  2  8 S1 001  1         NA         NA     154598   0
# 3: 2012 276  2 12 S1 001  1         NA    5118863   21819477  NA
# 4: 2012 276  2  0 S1 011  8 3127133583 3127133583 9003982501   0
like image 107
Arun Avatar answered Nov 05 '22 15:11

Arun


This apparently is an issue with the bit64 package, not fread() or data.table. From the bit64 documentation http://cran.r-project.org/web/packages/bit64/bit64.pdf

"Subscripting non-existing elements and subscripting with NAs is currently not supported. Such subscripting currently returns 9218868437227407266 instead of NA (the NA value of the un-derlying double code). Following the full R behaviour here would either destroy performance or require extensive C-coding."

I tried reassigning the 9218868437227407266 value to NA thinking it would work

Ex.

DT[V8==9218868437227407266, ]
#actually returns nothing, but
DT[V8==max(V8), ]
#returns the rows with 9218868437227407266 in V8
#but this does not reassign the value 
DT[V8==max(V8), V8:=NA]
#not that this makes sense, but I tried just in case...
DT[V8==max(V8), V8:=NA_character_]

So as the documentation pretty clearly states, if a vector is class integer64 it won't recognize NA or missing values. I've going to avoid bit64 just to not have to deal with this...

like image 22
Alex Thomas Avatar answered Nov 05 '22 17:11

Alex Thomas