Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cast string directly to IDateTime

I am using the new version of data.table and especially the AWESOME fread function. My files contain dates that are loaded as strings (cause I don't know to do it otherwise) looking like 01APR2008:09:00:00.

I need to sort the data.table on those datetimes and then for the sort to be efficient to cast then in the IDateTime format (or anything alse I would not know yet).

> strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
[1] "2008-04-01 09:00:00"

> IDateTime(strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S"))
        idate    itime
1: 2008-04-01 09:00:00

> IDateTime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
Error in charToDate(x) : 
character string is not in a standard unambiguous format 

It looks like I cannot do DT[ , newType := IDateTime(strptime(oldType, "%d%b%Y:%H:%M:%S"))].

My questions are then:

  1. Is there a way to cast directly to IDateTime from fread, such that I can sort afterward efficiently?
  2. If not, what is the most efficient way to go knowing that I would like to be able to sort DT by this datetime column
like image 426
statquant Avatar asked Dec 27 '12 14:12

statquant


People also ask

How do I convert a string to a timestamp in Python?

Import the datetime library. Use the datetime. datetime class to handle date and time combinations. Use the strptime method to convert a string datetime to a object datetime.

How do I convert a string to a datetime in Salesforce?

To convert a string to a Date/Time value, use DATETIMEVALUE() passing in a string in the format “YYYY-MM-DD HH:MM:SS”. This method returns the Date/Time value in GMT.

What is Strptime in Python?

The strptime() function in Python is used to format and return a string representation of date and time. It takes in the date, time, or both as an input, and parses it according to the directives given to it. It raises ValueError if the string cannot be formatted according to the provided directives.


2 Answers

Unfortunately (for efficiency) strptime produces a POSIXlt type, which is unsupported by data.table and always will be due its size (40 bytes per date!) and structure. Although strftime produces the much better POSIXct, it still does it via POSIXlt. More info here :

http://stackoverflow.com/a/12788992/403310

Looking to base functions such as as.Date, it uses strptime too, creating an integer offset from epoch (oddly) stored as double. The IDate (and friends) class in data.table aims to achieve integer epoch offsets stored as, um, integer. Suitable for fast sorting by base::sort.list(method = "radix") (which is really a counting sort). IDate doesn't really aim to be fast at (usually one off) conversion.

So to convert string dates/times, rightly or wrongly, I tend to roll my own helper function.

If the string date is "2012-12-24" I'd lean towards: as.integer(gsub("-", "", col)) and proceed with YYYYMMDD integer dates. Similarly times can be HHMMDD as an integer. Two columns: date and time separately can be useful if you generally want to roll = TRUE within a day, but not to the previous day. Grouping by month is simple and fast: by = date %/% 100L. Adding and subtracting days is troublesome, but it is anyway because rarely do you want to add calendar days, rather weekdays or business days. So that's a lookup to your business day vector anyway.

In your case the character month would need a conversion to 1:12. There isn't a separator in your dates "01APR2008", so a substring would be one way followed by a match or fmatch on the month name. Are you in control of the file format? If so, numbers are better in an unambiguous format that sorts naturally such as %Y-%m-%d, or %Y%m%d.

I haven't yet got to how best do this in fread, so date/times are left as character currently because I'm not yet sure how to detect the date format or which type to output. It does need to output either integer or double dates though, rather than inefficient character. I suspect that my use of YYYYMMDD integers are seen as unconventional, so I'm a little hesitant to make that the default. They have their place, and there are pros and cons of epoch based dates too. Dates don't have to be always epoch based is all I'm suggesting.

What do you think? Btw, thanks for encouragement on fread; was nice to see.

like image 65
Matt Dowle Avatar answered Oct 24 '22 23:10

Matt Dowle


I d'ont know how your file is structured, but from your comment you want to use the date field as a key. Why not to read it as a time series and format it when in reading?

Here I use zoo to do it.(Here I suppose that the date column is the first one,otherwise see index.colum argument)

ff <- function(x) as.POSIXct(strptime(x,"%d%b%Y:%H:%M:%S"))

h <- read.zoo(text = "03avril2008:09:00:00  125
                      02avril2008:09:30:00  126
                      05avril2008:09:10:00  127
                      04avril2008:09:20:00  128
                      01avril2008:09:00:00  128"
                      ,FUN=ff)

You get your dates sorted in the right format and sorted.

The conversion is natural from POSIXct to IDateTime

    IDateTime(index(h))
        idate    itime
1: 2008-04-01 09:00:00
2: 2008-04-02 09:30:00
3: 2008-04-03 09:00:00
4: 2008-04-04 09:20:00
5: 2008-04-05 09:10:00

Here sure you still do 2 conversions, But you do it when reading data, and the second you do it without dealing with any format problem.

like image 23
agstudy Avatar answered Oct 25 '22 00:10

agstudy