Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrong columns' modes when reading data with 'na.strings' and 'colClasses' arguments of 'fread' function in R

Windows 8.1, R version 3.1.1 (2014-07-10), System x86_64, mingw32

I've got a file with a lot of observations (here). Here are some strings from the file

Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
28/4/2007;00:20:00;0.492;0.208;236.240;2.200;0.000;0.000;0.000
28/4/2007;00:21:00;?;?;?;?;?;?;
21/12/2006;11:25:00;0.246;0.000;241.740;1.000;0.000;0.000;0.000
21/12/2006;11:26:00;0.246;0.000;241.830;1.000;0.000;0.000;0.000

The NA values are represented by "?". I'm trying to read the file with

epcData <- fread(dataFile,
                 sep = ";",
                 header = TRUE,
                 na.strings = "?",
                 colClasses = c("character", "character", rep("numeric", 7)),
                 stringsAsFactors = FALSE)

I've got warnings like:

Bumped column 3 to type character on data row 10, field contains '?'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

The row 10 is

   28/4/2007;00:21:00;?;?;?;?;?;?;

epcData[10]

prints

         Date     Time Global_active_power Global_reactive_power Voltage
1: 28/4/2076 00:21:00                  NA                    NA      NA
   Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1:               NA             NA             NA             NA

But the modes of all columns are "character" even for columns 3:9 (but colClasses = c("character", "character", rep("numeric", 7))).

What is going wrong?

like image 456
nodm Avatar asked Nov 01 '22 17:11

nodm


1 Answers

As of today with version 1.12.2 of the data.table package. This is no longer an issue and the import of the above csv data works flawlessly and all the question marks are replaced by NAs

like image 120
hannes101 Avatar answered Nov 08 '22 03:11

hannes101