Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert Date or Datetime field when some parts are blank; na.omit fails

I have a data set that has dates and times for in and out. Each line is an in and out set, but some are blank. I can remove the blanks with na.omit and a nice read in (it was a csv, and na.strings=c("") works on the read.csv).

Of course, because the real world is never like the tutorial, some of the times are only dates, so my as.POSIXlt(Dataset$In,format="%m/%d/%Y %H:%M") returns NA on the "only date no time"s.

na.omit does not remove these lines. so the questions are 2

  1. Why doesn't na.omit work, or how can I get it to work?

  2. Better, How can I convert one column into both Dates and Times (in the posix format) without 2 calls or with some sort of optional parameter in the format string? (or is this even possible?).

This is a sample of the dates and times. I can't share the real file, 1 it's huge, 2 it's PII.

Id,In,Out
1,8/15/2015 8:00,8/15/2015 17:00
1,8/16/2015 8:04,8/16/2015
1,8/17/2015 8:50,8/17/2015 18:00
1,8/18/2015,8/18/2015 17:00
2,8/15/2015,8/15/2015 13:00
2,8/16/2015 8:00,8/16/2015 17:00
3,8/15/2015 4:00,8/15/2015 11:00
3,8/16/2015 9:00,8/16/2015 19:00
3,8/17/2015,8/17/2015 17:00
3,,
4,,
4,8/16/2015 6:00,8/16/2015 20:00
like image 366
Rufus Shinra Avatar asked Nov 19 '15 19:11

Rufus Shinra


2 Answers

DF <- read.table(text = "Id,In,Out
                 1,8/15/2015 8:00,8/15/2015 17:00
                 1,8/16/2015 8:04,8/16/2015
                 1,8/17/2015 8:50,8/17/2015 18:00
                 1,8/18/2015,8/18/2015 17:00
                 2,8/15/2015,8/15/2015 13:00
                 2,8/16/2015 8:00,8/16/2015 17:00
                 3,8/15/2015 4:00,8/15/2015 11:00
                 3,8/16/2015 9:00,8/16/2015 19:00
                 3,8/17/2015,8/17/2015 17:00", header = TRUE, sep = ",",
                 stringsAsFactors = FALSE) #set this option during import


DF$In[nchar(DF$In) < 13] <- paste(DF$In[nchar(DF$In) < 13], "0:00")
DF$Out[nchar(DF$Out) < 13] <- paste(DF$Out[nchar(DF$Out) < 13], "0:00")

DF$In <- as.POSIXct(DF$In, format = "%m/%d/%Y %H:%M", tz = "GMT")
DF$Out <- as.POSIXct(DF$Out, format = "%m/%d/%Y %H:%M", tz = "GMT")
#  Id                  In                 Out
#1  1 2015-08-15 08:00:00 2015-08-15 17:00:00
#2  1 2015-08-16 08:04:00 2015-08-16 00:00:00
#3  1 2015-08-17 08:50:00 2015-08-17 18:00:00
#4  1 2015-08-18 00:00:00 2015-08-18 17:00:00
#5  2 2015-08-15 00:00:00 2015-08-15 13:00:00
#6  2 2015-08-16 08:00:00 2015-08-16 17:00:00
#7  3 2015-08-15 04:00:00 2015-08-15 11:00:00
#8  3 2015-08-16 09:00:00 2015-08-16 19:00:00
#9  3 2015-08-17 00:00:00 2015-08-17 17:00:00

na.omit doesn't work with POSIXlt objects because it is documented to "handle vectors, matrices and data frames comprising vectors and matrices (only)." (see help("na.omit")). And in the strict sense, POSIXlt objects are not vectors:

unclass(as.POSIXlt(DF$In))
#$sec
#[1] 0 0 0 0 0 0 0 0 0
#
#$min
#[1]  0  4 50  0  0  0  0  0  0
#
#$hour
#[1] 8 8 8 0 0 8 4 9 0
#
#$mday
#[1] 15 16 17 18 15 16 15 16 17
#
#$mon
#[1] 7 7 7 7 7 7 7 7 7
#
#$year
#[1] 115 115 115 115 115 115 115 115 115
#
#$wday
#[1] 6 0 1 2 6 0 6 0 1
#
#$yday
#[1] 226 227 228 229 226 227 226 227 228
#
#$isdst
#[1] 0 0 0 0 0 0 0 0 0
#
#attr(,"tzone")
#[1] "GMT"

There is hardly any reason to prefer POSIXlt over POSIXct (which is an integer giving the number of seconds since the origin internally and thus needs less memory).

like image 158
Roland Avatar answered Nov 05 '22 06:11

Roland


You've been given a couple of strategies that bring these character values in and process "in-place". I almost never use as.POSIXlt since there are so many pitfalls in dealing with the list-in-list structures that it returns, especially considering its effective incompatibility with dataframes. Here's a method that does the testing and coercion at the read.-level by defining an as-method:

setOldClass("inTime", prototype="POSIXct")
setAs("character", "inTime", 
      function(from) structure( ifelse( is.na(as.POSIXct(from, format="%m/%d/%Y %H:%M") ), 
                                          as.POSIXct(from, format="%m/%d/%Y") ,
                                          as.POSIXct(from, format="%m/%d/%Y %H:%M")  ), 
                                class="POSIXct" ) )

read.csv(text=txt, colClasses=c("numeric", 'inTime','inTime') )
  Id                  In                 Out
1  1 2015-08-15 08:00:00 2015-08-15 17:00:00
2  1 2015-08-16 08:04:00 2015-08-16 00:00:00
3  1 2015-08-17 08:50:00 2015-08-17 18:00:00
4  1 2015-08-18 00:00:00 2015-08-18 17:00:00
5  2 2015-08-15 00:00:00 2015-08-15 13:00:00
6  2 2015-08-16 08:00:00 2015-08-16 17:00:00
7  3 2015-08-15 04:00:00 2015-08-15 11:00:00
8  3 2015-08-16 09:00:00 2015-08-16 19:00:00
9  3 2015-08-17 00:00:00 2015-08-17 17:00:00

The structure "envelope" is needed because of the rather strange behavior of ifelse, which otherwise would return a numeric object rather than an object of class-'POSIXct'.

like image 25
IRTFM Avatar answered Nov 05 '22 07:11

IRTFM