When reading the text below, fread()
fails to detect the missing values in columns 8 and 9. This is only with the default option integer64="integer64"
. Setting integer64="double"
or "character"
correctly detects NA
s. Note that the file has three types of possible NAs in V8 and V9-- ,,
; , ,
; and NA
. Appending na.strings=c("NA","N/A",""," "), sep=","
as options has no effect.
Using read.csv()
works the same way as fread(integer="double")
.
Text to be read (also available as a file integer64_and_NA.csv):
2012,276,,0,"S1","001",1,,724135215,1590915056,
2012,276,2,8,"S1","001",1, ,,154598,0
2012,276,2,12,"S1","001",1,NA,5118863,21819477,
2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0
Here's the output from fread()
:
DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",")
Input contains no \n. Taking this to be a filename to open
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok
Found 11 columns
First row with 11 fields occurs on line 1 (either column names or first row of data)
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 5
Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows
Type codes: 11114412221 (first 5 rows)
Type codes: 11114412221 (after applying colClasses and integer64)
Type codes: 11114412221 (after applying drop or select (if supplied)
Allocating 11 column slots (11 - 0 NULL)
0.000s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.000s ( 0%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
0.000s ( 0%) Allocation of 4x11 result (xMB) in RAM
0.000s ( 0%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
0.001s Total
The resulting data.table is:
DT
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA
2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0
3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA
4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0
NA
values are properly detected in columns which are not integer64
. For V8 and V9, which fread()
marks as integer64, instead of NAs we have "9218868437227407266".
Interestingly enough, str()
returns the respective values of V8 and V9 as NA
:
str(DT)
Classes ‘data.table’ and 'data.frame': 4 obs. of 11 variables:
$ V1 : int 2012 2012 2012 2012
$ V2 : int 276 276 276 276
$ V3 : int NA 2 2 2
$ V4 : int 0 8 12 0
$ V5 : chr "S1" "S1" "S1" "S1"
$ V6 : chr "001" "001" "001" "011"
$ V7 : int 1 1 1 8
$ V8 :Class 'integer64' num [1:4] NA NA NA 1.55e-314
$ V9 :Class 'integer64' num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314
$ V10:Class 'integer64' num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314
$ V11: int NA 0 NA 0
- attr(*, ".internal.selfref")=<externalptr>
... but nothing else sees them as NA
:
is.na(DT$V8)
[1] FALSE FALSE FALSE FALSE
max(DT$V8)
integer64
[1] 9218868437227407266
> max(DT$V8, na.rm=TRUE)
integer64
[1] 9218868437227407266
> class(DT$V8)
[1] "integer64"
> typeof(DT$V8)
[1] "double"
It does not seem to be a print/screen issue only, data.table
sees them as huge integers:
DT[, V12:=as.numeric(V8)]
Warning message:
In as.double.integer64(V8) :
integer precision lost while converting to double
> DT
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA 9.218868e+18
2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0 9.218868e+18
3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA 9.218868e+18
4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 3.127134e+09
Am I missing something about integer64
, or is this a bug? As said above, I can get around using integer64="double"
, possibly losing some precision, as mentioned in the help file. But the unexpected behavior is with the default integer64
...
This was done on a Windows 8.1 64-bit machine running Revolution R 3.0.2, and also on a virtual machine running kubuntu 13.10, CRAN-R 3.0.2. Tested with the latest stable data.table from CRAN (1.8.10 as of 7 Feb 2014) and 1.8.11 (rev. 1110, 2014-02-04 02:43:19, manually installed from the zip as the r-forge build is broken) on Windows, and only the stable 1.8.10 on linux. bit64 is installed and loaded on both machines.
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] bit64_0.9-3 bit_1.1-11 gdata_2.13.2 xts_0.9-7 zoo_1.7-10 nlme_3.1-113 hexbin_1.26.3 lattice_0.20-24 ggplot2_0.9.3.1
[10] plyr_1.8 reshape2_1.2.2 data.table_1.8.11 Revobase_7.0.0 RevoMods_7.0.0 RevoScaleR_7.0.0
loaded via a namespace (and not attached):
[1] codetools_0.2-8 colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4 foreach_1.4.1 gtable_0.1.2 gtools_3.2.1 iterators_1.0.6
[9] labeling_0.2 MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 reshape_0.8.4 scales_0.2.3 stringr_0.6.2
[17] tools_3.0.2
This bug, #488, is now fixed with this commit in development version of data.table v1.9.5
, and values are assigned (and displayed) properly as NA
if bit64
is loaded.
require(data.table) # v1.9.5
require(bit64)
ans = fread("test.csv")
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 1: 2012 276 NA 0 S1 001 1 NA 724135215 1590915056 NA
# 2: 2012 276 2 8 S1 001 1 NA NA 154598 0
# 3: 2012 276 2 12 S1 001 1 NA 5118863 21819477 NA
# 4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0
This apparently is an issue with the bit64 package, not fread()
or data.table
. From the bit64
documentation http://cran.r-project.org/web/packages/bit64/bit64.pdf
"Subscripting non-existing elements and subscripting with NAs is currently not supported. Such subscripting currently returns 9218868437227407266 instead of NA (the NA value of the un-derlying double code). Following the full R behaviour here would either destroy performance or require extensive C-coding."
I tried reassigning the 9218868437227407266 value to NA thinking it would work
Ex.
DT[V8==9218868437227407266, ]
#actually returns nothing, but
DT[V8==max(V8), ]
#returns the rows with 9218868437227407266 in V8
#but this does not reassign the value
DT[V8==max(V8), V8:=NA]
#not that this makes sense, but I tried just in case...
DT[V8==max(V8), V8:=NA_character_]
So as the documentation pretty clearly states, if a vector is class integer64 it won't recognize NA or missing values. I've going to avoid bit64 just to not have to deal with this...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With