Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse string datetime & timezone with Arabic-Hindu digits in Java 8?

I wanted to parse string datetime & timezone with Arabic-Hindu digits, so I wrote a code like this:

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠";
    char zeroDigit = '٠';
    Locale locale = Locale.forLanguageTag("ar");
    DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ssXXX")
            .withLocale(locale)
            .withDecimalStyle(DecimalStyle.of(locale).withZeroDigit(zeroDigit));
    ZonedDateTime parsedDateTime = ZonedDateTime.parse(dateTime, pattern);
    assert parsedDateTime != null;

But I received the exception:

java.time.format.DateTimeParseException: Text '٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠' could not be parsed at index 19

I checked a lot of questions on Stackoverflow, but I still don't understand what I did wrong.

It works fine with dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+02:00" when the timezone doesn't use Arabic-Hindu digits.

like image 388
Daria Pydorenko Avatar asked Nov 08 '21 12:11

Daria Pydorenko


People also ask

What is DateTime parsing?

Parse methods Parse converts the datetime string into a DateTime . It automatically tries to figure out the datetime format. The DateTime. ParseExact method converts the specified string representation of a datetime to a DateTime .

What does DateTime parse () do in C#?

Converts the string representation of a date and time to its DateTime equivalent by using culture-specific format information. Converts a memory span that contains string representation of a date and time to its DateTime equivalent by using culture-specific format information and a formatting style.


2 Answers

Your dateTime string is wrong, misunderstood. It obviously tries to conform to the ISO 8601 format and fails. Because the ISO 8601 format uses US-ASCII digits.

The classes of java.time (Instant, OffsetDateTime and ZonedDateTime) would parse your string without any formatter if only the digits were correct for ISO 8601. In the vast majority of cases I would take your avenue: try to parse the string as it is. Not in this case. To me it makes more sense to correct the string before parsing.

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠";
    char[] dateTimeChars = dateTime.toCharArray();
    for (int index = 0; index < dateTimeChars.length; index++) {
        if (Character.isDigit(dateTimeChars[index])) {
            int digitValue = Character.getNumericValue(dateTimeChars[index]);
            dateTimeChars[index] = Character.forDigit(digitValue, 10);
        }
    }
    
    OffsetDateTime odt = OffsetDateTime.parse(CharBuffer.wrap(dateTimeChars));
    
    System.out.println(odt);

Output:

2021-11-08T02:21:08+02:00

Edit: It will be even better, of course, if you can educate the publisher of the string to use US-ASCII digits.

Edit: I know the Wikipedia article I link to below says:

Representations must be written in a combination of Arabic numerals and the specific computer characters (such as "-", ":", "T", "W", "Z") that are assigned specific meanings within the standard; …

This is one thinkable cause of the confusion. The article Arabic numerals linked to says:

Arabic numerals are the ten digits: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.

Edit: How I convert each digit: Character.getNumericValue() converts from a char representing a digit to an int equal to the number that the digit represents, so '٠' to 0, '٢' to 2, etc. It works for all characters that are digits (not only Arabic and ASCII ones). Character.forDigit() performs sort of the opposite conversion, only always to US ASCII, so 0 to '0', 2 to '2', etc.

Edit: Thanks to @Holger for drawing my attention to CharBuffer in this context. A CharBuffer implements CharSequence, the type that the parse methods of java.time require, so saves us from converting the char array back to a String.

Links

  • Wikipedia article: ISO 8601
  • Wikipedia article: Arabic numerals
like image 133
Ole V.V. Avatar answered Oct 10 '22 22:10

Ole V.V.


The error message states that the problem is at index 19 in the input string.

Character 19 is the + character in your input string. This means the offset (represented by XXX in your pattern) cannot be parsed.

The problem is not the + itself. The problem is that timezone offsets, like +05:00, are never localized.

The documentation doesn’t talk about this, so I had to go to the source code of DateTimeFormatterBuilder to verify it.

Inside that class is this inner class:

static final class OffsetIdPrinterParser implements DateTimePrinterParser {

In that class, we can find a parse method which has calls to the private parseHour, parseMinute, and parseSeconds methods.

Each of those methods delegates to a private parseDigits method. In that method, we can see that only ASCII digits are considered:

char ch1 = parseText.charAt(pos++);
char ch2 = parseText.charAt(pos++);
if (ch1 < '0' || ch1 > '9' || ch2 < '0' || ch2 > '9') {
    return false;
}

So, the answer here is that the timezone offset must consist of ASCII digits, regardless of the locale.

like image 30
VGR Avatar answered Oct 10 '22 22:10

VGR