Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elastic Search and Y10k (years with more than 4 digits)

I discovered this issue in connection with Elastic Search queries, but since the ES date format documentation links to the API documentation for the java.time.format.DateTimeFormatter class, the problem is not really ES specific.

Short summary: We are having problems with dates beyond year 9999, more exactly, years with more than 4 digits.

The documents stored in ES have a date field, which in the index descriptor is defined with format "date", which corresponds to "yyyy-MM-dd" using the pattern language from DateTimeFormatter. We are getting user input, validate the input using org.apache.commons.validator.DateValidator.isValid also with the pattern "yyyy-MM-dd" and if valid, we create an ES query with the user input. This fails with an execption if the user inputs something like 20202-12-03. The search term is probably not intentional, but the expected behaviour would be not to find anything and not that the software coughs up an exception.

The problem is that org.apache.commons.validator.DateValidator is internally using the older SimpleDateFormat class to verify if the input conforms to the pattern and the meaning of "yyyy" as interpreted by SimpleDateFormat is something like: Use at least 4 digits, but allow more digits if required. Creating a SimpleDateFormat with pattern "yyyy-MM-dd" will thus both parse an input like "20202-07-14" and similarly format a Date object with a year beyond 9999.

The new DateTimeFormatter class is much more strict and means with "yyyy" exactly four digits. It will fail to parse an input string like "20202-07-14" and also fail to format a Temporal object with a year beyond 9999. It is worth to notice that DateTimeFormatter is itself capable of handling variable-length fields. The constant DateTimeFormatter.ISO_LOCAL_DATE is for example not equivalent to "yyyy-MM-dd", but does, conforming with ISO8601, allow years with more than four digits, but will use at least four digits. This constant is created programmatically with a DateTimeFormatterBuilder and not using a pattern string.

ES can't be configured to use the constants defined in DateTimeFormatter like ISO_LOCAL_DATE, but only with a pattern string. ES also knows a list of predefined patterns, occasionally the ISO standard is also referred to in the documentation, but they seem to be mistaken and ignore that a valid ISO date string can contain five digit years.

I can configure ES with a list of multiple allowed date patterns, e.g "yyyy-MM-dd||yyyyy-MM-dd". That will allow both four and five digits in the year, but fail for a six digit year. I can support six digit years by adding yet another allowed pattern: "yyyy-MM-dd||yyyyy-MM-dd||yyyyyy-MM-dd", but then it fails for seven digit years and so on.

Am I overseeing something, or is it really not possible to configure ES (or a DateTimeFormatter instance using a pattern string) to have a year field with at least four digits (but potentially more) as used by the ISO standard?

like image 607
jarnbjo Avatar asked Jun 23 '20 18:06

jarnbjo


2 Answers

Edit

ISO 8601

Since your requirement is to conform with ISO 8601, let’s first see what ISO 8601 says (quoted from the link at the bottom):

To represent years before 0000 or after 9999, the standard also permits the expansion of the year representation but only by prior agreement between the sender and the receiver. An expanded year representation [±YYYYY] must have an agreed-upon number of extra year digits beyond the four-digit minimum, and it must be prefixed with a + or − sign instead of the more common AD/BC (or CE/BCE) notation; …

So 20202-12-03 is not a valid date in ISO 8601. If you explicitly inform your users that you accept, say, up to 6 digit years, then +20202-12-03 and -20202-12-03 are valid, and only with the + or - sign.

Accepting more than 4 digits

The format pattern uuuu-MM-dd formats and parses dates in accordance with ISO 8601, also years with more than four digits. For example:

    DateTimeFormatter dateFormatter = DateTimeFormatter.ofPattern("uuuu-MM-dd");
    LocalDate date = LocalDate.parse("+20202-12-03", dateFormatter);
    System.out.println("Parsed: " + date);
    System.out.println("Formatted back: " + date.format(dateFormatter));

Output:

Parsed: +20202-12-03
Formatted back: +20202-12-03

It works quite similarly for a prefixed minus instead of the plus sign.

Accepting more than 4 digits without sign

    yyyy-MM-dd||yyyyy-MM-dd||yyyyyy-MM-dd||yyyyyyy-MM-dd||yyyyyyyy-MM-dd||yyyyyyyyy-MM-dd

As I said, this disagrees with ISO 8601. I also agree with you that it isn’t nice. And obviously it will fail for 10 or more digits, but that would fail for a different reason anyway: java.time handles years in the interval -999 999 999 through +999 999 999. So trying yyyyyyyyyy-MM-dd (10 digit year) would get you into serious trouble except in the corner case where the user enters a year with a leading zero.

I am sorry, this is as good as it gets. DateTimeFormatter format patterns do not support all of what you are asking for. There is no (single) pattern that will give you four digit years in the range 0000 through 9999 and more digits for years after that.

The documentation of DateTimeFormatter says about formatting and parsing years:

Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years as per SignStyle.NORMAL. Otherwise, the sign is output if the pad width is exceeded, as per SignStyle.EXCEEDS_PAD.

So no matter which count of pattern letters you go for, you will be unable to parse years with more digits without sign, and years with fewer digits will be formatted with this many digits with leading zeroes.

Original answer

You can probably get away with the pattern u-MM-dd. Demonstration:

    String formatPattern = "u-MM-dd";
    
    DateTimeFormatter dateFormatter = DateTimeFormatter.ofPattern(formatPattern);
    
    LocalDate normalDate = LocalDate.parse("2020-07-14", dateFormatter);
    String formattedAgain = normalDate.format(dateFormatter);
    System.out.format("LocalDate: %s. String: %s.%n", normalDate, formattedAgain);
    
    LocalDate largeDate = LocalDate.parse("20202-07-14", dateFormatter);
    String largeFormattedAgain = largeDate.format(dateFormatter);
    System.out.format("LocalDate: %s. String: %s.%n", largeDate, largeFormattedAgain);

Output:

LocalDate: 2020-07-14. String: 2020-07-14.
LocalDate: +20202-07-14. String: 20202-07-14.

Counter-intuituvely but very practically one format letter does not mean 1 digit but rather as many digits as it takes. So the flip side of the above is that years before year 1000 will be formatted with fewer than 4 digits. Which, as you say, disagrees with ISO 8601.

For the difference between pattern letter y and u for year see the link at the bottom.

You might also consider one M and/or one d to accept 2020-007-014, but again, this will cause formatting into just 1 digit for numbers less than 10, like 2020-7-14, which probably isn’t what you want and again disagrees with ISO.

Links

  • Years section of Wikipedia article: ISO 8601
  • Documentation of DateTimeFormatter
  • uuuu versus yyyy in DateTimeFormatter formatting pattern codes in Java?
like image 181
Ole V.V. Avatar answered Nov 06 '22 04:11

Ole V.V.


Maybe this will work:

[uuuu][uuuuu][...]-MM-dd

Format specifiers placed between square brackets are optional parts. Format specifiers inside brackets can be repeated to allow for multiple options to be accepted.

This pattern will allow a year number of either four or five digits, but rejects all other cases.

Here is this pattern in action. Note that this pattern is useful for parsing a string into a LocalDate. However, to format a LocalDate instance into a string, the pattern should be uuuu-MM-dd. That is because the two optional year parts cause the year number to be printed twice.

Repeating all possible year number digit counts, is the closest you can get in order to make it work the way you expect it to work.

The problem with the current implementation of DateTimeFormatter is that when you specify 4 or more u or ys, the resolver will try to consume exactly that number of year digits. However, with less than 4, then the resolver will try to consume as many as possible. I do not know whether this behavior is intentional.

So the intended behavior can be achieved with a formatter builder, but not with a pattern string. As JodaStephen once pointed out, "patterns are a subset of the possible formatters".


Maybe the characters #, { and }, which are reserved for future use, will be useful in this regard.

like image 30
MC Emperor Avatar answered Nov 06 '22 02:11

MC Emperor