I need that Python_dateutil 1.5 parse() work with Unicode month names.
If use fuzzy=True it skips month name and produce result with month = 1
When I use it without fuzzy parameter I get the next exception:
from dateutil.parser import parserinfo, parser, parse
class myparserinfo(parserinfo):
MONTHS = parserinfo.MONTHS[:]
MONTHS[3] = (u"Foo", u"Foo", u"Июнь")
>>> test = unicode('8th of Июнь', 'utf-8')
>>> tester = parse(test, parserinfo=myparserinfo())
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 695, in parse
return parser(parserinfo).parse(timestr, **kwargs)
File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 303, in parse
raise ValueError, "unknown string format"
ValueError: unknown string format
Rik Poggi is right, string 'Июнь' cannot be a month for python-dateutil
. Digging a little into dateutil/parser.py
, the basic problem is that this module is only internationalised enough for handling Western European Latin-script languages. It is not designed up to be able to handle languages, such as Russian, using non-Latin scripts, such as Cyrillic.
The biggest obstacle is in dateutil/parser.py:45-48
, where the lexical analyser class _timelex
defines the characters which can be used in tokens, including month and day names:
class _timelex(object):
def __init__(self, instream):
# ... [some material omitted] ...
self.wordchars = ('abcdfeghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ_'
'ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
self.numchars = '0123456789'
self.whitespace = ' \t\r\n'
Because wordchars
does not include Cyrillic letters, _timelex
emits each byte in the date string as a separate character. This is what Rik observed.
Another large obstacle is that dateutil
uses Python byte strings instead of Unicode strings internally for all of its processing. This means that, even if _timelex was extended to accept Cyrillic letters, then there would still be mismatches between handling of bytes and of characters, and problems caused by difference in string encoding between the caller and python_dateutil
source code.
There are other minor issues, such as an assumption that every month name is at least 3 characters long (not true for Japanese), and many details related to the Gregorian calendar. It would be helpful for the wordchars
field to be picked up from parserinfo
if present, so that parserinfo could define the right set of characters for its month and day names.
python_dateutil
v 2.0 has been ported to Python 3, but the above design problems aren't significantly changed. The differences betwen 2.0 and 1.5 are to handle Pyhon language changes, not dateutil's design and data structures.
Oleg, you were able to modify parserinfo, and I suspect you succeeded because your test code didn't use the parser()
(and _timelex
) of python_dateutil
. You in essence supplied your own parser and lexer.
Correcting this problem would require fairly major improvements to the text-handling of python_dateutil
. It would be great if someone were to make a patch with that change, and the package maintainers were able to incorporate it.
I took a look at the source code in dateutil/parser.py
, and I've found out basically that the string 'Июнь'
cannot be a month for dateutil.
The problem starts when your timestr
gets splitted.
At line 349 you have:
l = _timelex.split(timestr)
and since _timelex.split
is defined like:
def split(cls, s): # at line 142
return list(cls(s))
you get l
to be:
['8', 'th', ' ', 'of', ' ', '\x18', '\x04', 'N', '\x04', '=', '\x04', 'L', '\x04']
instead of (more or less) what one would expected it to be:
[u'8th', u'of', u'\u0418\u044e\u043d\u044c']
For this reason the month check return None
, which leads to raise an Exception.
# Check month name
value = info.month(l[i])
Translate everything in english and then if needed back in russian.
Example:
dictionary = {u"Июнь": 'June', u'ноябрь': 'November'}
for russian,english in dictionary.items():
test = test.replace(russian,english)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With