Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why R package lubridate can't parse vector with multiple formats?

Tags:

date

r

lubridate

I'm using package lubridate to parse a vector of heterogeneously-formatted dates and convert them to string, like this:

parse_date_time(c('12/17/1996 04:00:00 PM','4/18/1950 0130'), c('%m/%d/%Y %I:%M:%S %p','%m/%d/%Y %H%M'))

This is the result:

[1] NA NA
Warning message:
All formats failed to parse. No formats found.

If I remove the %p in the 1st format string, it incorrectly parses the 1st date string, and still doesn't parse the 2nd, like so:

[1] "1996-12-17 04:00:00 UTC" NA                       
Warning message:
 1 failed to parse. 

The 4PM time in the string is parsed to 4AM in the result.

Has anyone experienced this strange behavior?

like image 915
Jesus Ramos Avatar asked May 20 '15 01:05

Jesus Ramos


2 Answers

This probably relate to your system locale.

  • parse_date_time {lubridate}

    p : AM/PM indicator in the locale. Used in conjunction with I and not with H. An empty string in some locales.

Because different languages have different string for AM/PM, if your locale is not English, lubridate will not pick up the AM/PM indicator even if you specify it.

The locale in OS could include display language, time format, time zones. I'm using English windows with US time zone and Chinese locale, so I had been fighting with AM/PM in time parsing too.

Sys.getlocale("LC_TIME")
[1] "Chinese (Simplified)_China.936"

You can specify locale in parse_date_time {lubridate}, but it didn't work for me at first:

Sys.setlocale("LC_TIME", "en_US") 
[1] ""
Warning message:
In Sys.setlocale("LC_TIME", "en_US") :
  OS reports request to set locale to "en_US" cannot be honored
  • locales {base}

    The locale describes aspects of the internationalization of a program. Initially most aspects of the locale of R are set to "C" (which is the default for the C language and reflects North-American usage). strptime for uses of category = "LC_TIME".

Then I found this and used this to success:

Sys.setlocale("LC_TIME", "C")
[1] "C"

After this the parsing works:

parse_date_time('12/17/1996 04:00:00 PM', '%m/%d/%Y %I:%M:%S %p')
[1] "1996-12-17 16:00:00 UTC"

You can also specify time zone and locale

parse_date_time('12/17/1996 04:00:00 PM', '%m/%d/%Y %I:%M:%S %p', tz = "America/New_York", locale = "C")
[1] "1996-12-17 16:00:00 EST"
like image 66
dracodoc Avatar answered Nov 19 '22 09:11

dracodoc


The problem with %p part is locale related. See this issue.

The inability to parse has to do with the way lubridate guesser works.

Tthere are two ways lubridate infers formats, flex and exact. With flex matching all numeric elements can have flexible length (for example both 4 and 04 for day will work), but then, there must be non-numeric separators between the elements. For the exact matcher there need not be non-numeric separators but elements must have exact number of digits (like 04).

Unfortunately you cannot combine both matchers within one expression. It would be extremely hard to fix this and preserve the current flexibility of the lubridate parser.

In your example

> parse_date_time('4/18/1950 0130', 'mdY HM')
[1] NA
Warning message:
All formats failed to parse. No formats found. 

you want to perform flex matching on the date part 4/18/1950 and exact matching on time part 0130.

Please note that if your date-time is in fully flex, or fully exact format the parsing will work as expected:

> parse_date_time('04/18/1950 0130', 'mdY HM')
[1] "1950-04-18 01:30:00 UTC"
> parse_date_time('4/18/1950 1:30', 'mdY HM')
[1] "1950-04-18 01:30:00 UTC"

The lubridate 1.4.1 "fixes" this by adding a new argument to parse_date_time, exact=FALSE. When set toTRUE the orders argument is interpreted as containing exact strptime formats and no guessing or training is performed. This way you can add as many exact formats as you want and you will also gain in speed because no guessing is performed at all.

> parse_date_time(c('12/17/1996 04:00:00','4/18/1950 0130'),
+                 c('%m/%d/%Y %I:%M:%S','%m/%d/%Y %H%M'),
+                 exact = T)
[1] "1996-12-17 04:00:00 UTC" "1950-04-18 01:30:00 UTC"

Relatedly, there was an explicit requested asking for such an option.

like image 40
VitoshKa Avatar answered Nov 19 '22 10:11

VitoshKa