Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use python_dateutil 1.5 'parse' function to work with unicode?

I need that Python_dateutil 1.5 parse() work with Unicode month names.

If use fuzzy=True it skips month name and produce result with month = 1

When I use it without fuzzy parameter I get the next exception:

from dateutil.parser import parserinfo, parser, parse

class myparserinfo(parserinfo):
    MONTHS = parserinfo.MONTHS[:]
    MONTHS[3] = (u"Foo", u"Foo", u"Июнь")


>>> test = unicode('8th of Июнь', 'utf-8')
>>> tester = parse(test, parserinfo=myparserinfo())
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 695, in parse
    return parser(parserinfo).parse(timestr, **kwargs)
  File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format
like image 430
Oleg Dats Avatar asked Jan 17 '12 14:01

Oleg Dats


2 Answers

Rik Poggi is right, string 'Июнь' cannot be a month for python-dateutil. Digging a little into dateutil/parser.py, the basic problem is that this module is only internationalised enough for handling Western European Latin-script languages. It is not designed up to be able to handle languages, such as Russian, using non-Latin scripts, such as Cyrillic.

The biggest obstacle is in dateutil/parser.py:45-48, where the lexical analyser class _timelex defines the characters which can be used in tokens, including month and day names:

class _timelex(object):
    def __init__(self, instream):
        # ... [some material omitted] ...
        self.wordchars = ('abcdfeghijklmnopqrstuvwxyz'
                          'ABCDEFGHIJKLMNOPQRSTUVWXYZ_'
                          'ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
                          'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
        self.numchars = '0123456789'
        self.whitespace = ' \t\r\n'

Because wordchars does not include Cyrillic letters, _timelex emits each byte in the date string as a separate character. This is what Rik observed.

Another large obstacle is that dateutil uses Python byte strings instead of Unicode strings internally for all of its processing. This means that, even if _timelex was extended to accept Cyrillic letters, then there would still be mismatches between handling of bytes and of characters, and problems caused by difference in string encoding between the caller and python_dateutil source code.

There are other minor issues, such as an assumption that every month name is at least 3 characters long (not true for Japanese), and many details related to the Gregorian calendar. It would be helpful for the wordchars field to be picked up from parserinfo if present, so that parserinfo could define the right set of characters for its month and day names.

python_dateutil v 2.0 has been ported to Python 3, but the above design problems aren't significantly changed. The differences betwen 2.0 and 1.5 are to handle Pyhon language changes, not dateutil's design and data structures.

Oleg, you were able to modify parserinfo, and I suspect you succeeded because your test code didn't use the parser() (and _timelex) of python_dateutil. You in essence supplied your own parser and lexer.

Correcting this problem would require fairly major improvements to the text-handling of python_dateutil. It would be great if someone were to make a patch with that change, and the package maintainers were able to incorporate it.

like image 92
Jim DeLaHunt Avatar answered Oct 16 '22 12:10

Jim DeLaHunt


I took a look at the source code in dateutil/parser.py, and I've found out basically that the string 'Июнь' cannot be a month for dateutil.

The problem starts when your timestr gets splitted.

At line 349 you have:

l = _timelex.split(timestr)

and since _timelex.split is defined like:

def split(cls, s):      # at line 142
    return list(cls(s))

you get l to be:

['8', 'th', ' ', 'of', ' ', '\x18', '\x04', 'N', '\x04', '=', '\x04', 'L', '\x04']

instead of (more or less) what one would expected it to be:

[u'8th', u'of', u'\u0418\u044e\u043d\u044c']

For this reason the month check return None , which leads to raise an Exception.

# Check month name
value = info.month(l[i])

Possible workaround:

Translate everything in english and then if needed back in russian.

Example:

dictionary = {u"Июнь": 'June', u'ноябрь': 'November'}

for russian,english in dictionary.items():
    test = test.replace(russian,english)
like image 35
Rik Poggi Avatar answered Oct 16 '22 12:10

Rik Poggi