Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas parse non-english string dates

Pandas is pretty great at parsing string dates when they are in english:

In [1]: pd.to_datetime("11 January 2014 at 10:50AM")
Out[1]: Timestamp('2014-01-11 10:50:00')

I'm wondering if there's an easy way to do the same using pandas when strings are in another language, for example in french:

In [2]: pd.to_datetime("11 Janvier 2016 à 10:50")

ValueError: Unknown string format

Ideally, there would be a way to do it directly in pd.read_csv.

like image 845
Julien Marrec Avatar asked Dec 09 '16 00:12

Julien Marrec


1 Answers

There is a module named dateparser that is capable of handling numerous languages including french, russian, spanish, dutch and over 20 more. It also can recognize stuff like time zone abbreviations, etc.

Let's confirm it works for a single date:

In [1]: import dateparser
        dateparser.parse('11 Janvier 2016 à 10:50')
Out[1]: datetime.datetime(2016, 1, 11, 10, 50)

Moving on to parsing this test_dates.csv file:

               Date  Value
0    7 janvier 1983     10
1  21 décembre 1986     21
2    1 janvier 2016     12

You can actually use dateparser.parse as the parser:

In [2]: df = pd.read_csv('test_dates.csv',
                         parse_dates=['Date'], date_parser=dateparser.parse)
        print(df)

Out [2]:
        Date  Value
0 1983-01-07     10
1 1986-12-21     21
2 2016-01-01     12

Obviously if you need to do that after having already loaded the dataframe, you can always use apply, or map:

# Using apply (6.22 ms per loop)
df.Date = df.Date.apply(lambda x: dateparser.parse(x))

# Or map which is slightly slower (7.75 ms per loop)
df.Date = df.Date.map(dateparser.parse)
like image 198
Julien Marrec Avatar answered Sep 30 '22 21:09

Julien Marrec