Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use russian date string with strptime

I parse html with python and there is date string: [ 24-Янв-17 07:24 ]. "Янв" is "Jan". I want to convert it into datetime object.

# Some beautifulsoup parsing
timeData = data.find('div', {'id' : 'time'}).text

import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
result = datetime.datetime.strptime(timeData, u'[ %d-%b-%y  %H:%M ]')

The error is:

ValueError: time data '[ 24-\xd0\xaf\xd0\xbd\xd0\xb2-17 07:24 ]' does not match format '[ %d-%b-%y %H:%M ]'

type(timeData) returns unicode. Encoding timeData from utf-8 returns UnicodeEncodeError. What's wrong?


chardet returns {'confidence': 0.87625, 'encoding': 'utf-8'} and when I write: datetime.datetime.strptime(timeData.encode('utf-8'), ...) it returns error as above.


Original page has window-1251 encoding.

print type(timeData)
print timeData


timeData = timeData.encode('cp1251')
print type(timeData)
print timeData

returns

<type 'unicode'>
[ 24-Янв-17 07:24 ]
<type 'str'>
[ 24-???-17 07:24 ]
like image 229
Max Frai Avatar asked Jan 24 '17 21:01

Max Frai


People also ask

How do you find a Strptime date?

We can convert string format to DateTime by using the strptime() function. We will use the '%Y/%m/%d' format to get the string to datetime.

What does the P stand for in Strptime?

strptime() -> string parsed time.

How do I use Strftime and Strptime in Python?

Python time strptime() MethodThe format parameter uses the same directives as those used by strftime(); it defaults to "%a %b %d %H:%M:%S %Y" which matches the formatting returned by ctime(). If string cannot be parsed according to format, or if it has excess data after parsing, ValueError is raised.

How do I convert a date string to a date in python?

To convert string to datetime in Python, use the strptime() method. The strptime() is a built-in function of the Python datetime class used to convert a string representation of the​ date/time to a date object.


1 Answers

Quick fix

Got it! янв has to be lower-case in CPython 2.7.12. Code (works in CPy 2.7.12 and CPy 3.4.5 on cygwin):

# coding=utf8
#timeData='[ 24-Янв-17 07:24 ]'
timeData='[ 24-янв-17 07:24 ]'    ### lower-case
import datetime
import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
result = datetime.datetime.strptime(timeData, u'[ %d-%b-%y  %H:%M ]')
print(result)

result:

2017-01-24 07:24:00

If I use the upper-case Янв, it works in Py 3, but in Py 2 it gives

ValueError: time data '[ 24-\xd0\xaf\xd0\xbd\xd0\xb2-17 07:24 ]' does not match format '[ %d-%b-%y  %H:%M ]'

General case

To handle this in general in Python 2, lower-case first (see this answer):

# coding=utf8
timeData=u'[ 24-Янв-17 07:24 ]'
       # ^ unicode data
import datetime
import locale
locale.setlocale(locale.LC_TIME, 'ru_RU.UTF-8')
print(timeData.lower())     # works OK
result = datetime.datetime.strptime(
    timeData.lower().encode('utf8'), u'[ %d-%b-%y  %H:%M ]')
    ##               ^^^^^^^^^^^^^^ back to a string
    ##       ^^^^^^^ lowercase
print(result)

Result:

[ 24-янв-17 07:24 ]
2017-01-24 07:24:00

I can't test it with your beautifulsoup code, but, in general, get Unicode data and then use the above.

Or, if at all possible, switch to Python 3 :) .

Explanation

So how did I figure this out? I went looking in the CPython source for the code to strptime (search). I found the handy _strptime module, containing class LocaleTime. I also found a mention of LocaleTime. To print the available month names, do this (added on to the end of the code under "Quick fix," above):

from _strptime import LocaleTime
lt = LocaleTime()
print(lt.a_month)    

a_month has the abbreviated month names per the source.

On Py3, that yields:

['', 'янв', 'фев', 'мар', 'апр', 'май', 'июн', 'июл', 'авг', 'сен', 'окт', 'ноя', 'дек']
      ^ lowercase!

On Py2, that yields:

['', '\xd1\x8f\xd0\xbd\xd0\xb2',

and a bunch more. Note that the first character is \xd1\x8f, and in your error message, \xd0\xaf doesn't match.

like image 158
cxw Avatar answered Oct 02 '22 20:10

cxw