Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse dates in format dmy together with dmY using parse_date_time

Tags:

date

r

lubridate

I have a vector of character representation of dates, where formats mostly are dmY (e.g. 27-09-2013), dmy (e.g. 27-09-13), and occasionally some b or B months. Thus, parse_date_time in package lubridate that "allows the user to specify several format-orders to handle heterogeneous date-time character representations" could be a very useful function for me.

However, it seems that parse_date_time has problem parsing dmy dates when they occur together with dmY dates. When parsing dmy alone, or dmy together with some other formats relevant to me, it works fine. This pattern was also noted in a comment to @Peyton's answer here. A quick fix was suggested, but I wish to ask if it is possible to handle it in lubridate.

Here I show some examples where I try to parse dates on dmy format together with some other formats, and specifying orders accordingly.

library(lubridate)
# version: lubridate_1.3.0

# regarding how date format is specified in 'orders':
# examples in ?parse_date_time
# parse_date_time(x, "ymd")
# parse_date_time(x, "%y%m%d")
# parse_date_time(x, "%y %m %d")
# these order strings are equivalent and parses the same way
# "Formatting orders might include arbitrary separators. These are discarded"

# dmy date only
parse_date_time(x = "27-09-13", orders = "d m y")
# [1] "2013-09-27 UTC"
# OK

# dmy & dBY
parse_date_time(c("27-09-13", "27 September 2013"), orders = c("d m y", "d B Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK

# dmy & dbY
parse_date_time(c("27-09-13", "27 Sep 2013"), orders = c("d m y", "d b Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK

# dmy & dmY
parse_date_time(c("27-09-13", "27-09-2013"), orders = c("d m y", "d m Y"))
# [1] "0013-09-27 UTC" "2013-09-27 UTC"
# not OK

# does order of the date components matter?
parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y m d", "y m d"))
# [1] "2013-09-27 UTC" "0013-09-27 UTC"
# no

What about the select_formats argument? I am sorry to say this, but I have a hard time understand this section of the help file. And a search for select_formats on SO: 0 results. Still, this section seemed relevant: "By default the formats with most formating tockens (%) are selected and %Y counts as 2.5 tockens (so that it can have priority over %y%m).". So I (desperately) tried with some additional dmy dates:

parse_date_time(c("27-09-2013", rep("27-09-13", 10)), orders = c("d m y", "d m Y"))
# not OK. Tried also 100 dmy dates.

# does order in the vector matter?
parse_date_time(c(rep("27-09-13", 10), "27-09-2013"), orders = c("d m y", "d m Y"))
# no

I then checked how the guess_formats function (also in lubridate) handled dmy together with dmY:

guess_formats(c("27-09-13", "27-09-2013"), c("dmy", "dmY"), print_matches = TRUE)
#                   dmy        dmY       
# [1,] "27-09-13"   "%d-%m-%y" ""        
# [2,] "27-09-2013" "%d-%m-%Y" "%d-%m-%Y"
# OK   

From ?guess_formats: y also matches Y. From ?parse_date_time: y* Year without century (00–99 or 0–99). Also matches year with century (Y format). So I tried:

guess_formats(c("27-09-13", "27-09-2013"), c("dmy"), print_matches = TRUE)
#                   dmy       
# [1,] "27-09-13"   "%d-%m-%y"
# [2,] "27-09-2013" "%d-%m-%Y"
# OK

Thus, guess_format seems to be able to deal with dmy together with dmY. But how can I tell parse_date_time to do the same? Thanks in advance for any comments or help.

Update I posted the question on the lubridate bug report, and got a rapid reply from @vitoshka: "This is a bug".

like image 642
Henrik Avatar asked Oct 01 '13 22:10

Henrik


1 Answers

It looks like a bug. I am not sure So you should contact the maintainer.

Building the package source and changing one line in this internal function ( I replace which.max by wich.min):

.select_formats <-   function(trained){
  n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%Y", names(trained))*1.5
  names(trained[ which.min(n_fmts) ]) ## replace which.max  by which.min
}

seems to correct the problem. Frankly I don't know why this works, but I guess it is a kind of ranking..

parse_date_time(c("27-09-13", "27-09-2013"), orders = c("d m y", "d m Y"))
[1] "2013-09-27 UTC" "2013-09-27 UTC"

parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y m d", "y m d"))
[1] "2013-09-27 UTC" "2013-09-13 UTC"
like image 72
agstudy Avatar answered Dec 15 '22 23:12

agstudy