Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Making Pandas work with Pendulum

I've recently stumbled upon a new awesome pendulum library for easier work with datetimes.

In pandas, there is this handy to_datetime() method allowing to convert series and other objects to datetimes:

raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')

What would be the canonical way to create a custom to_<something> method - in this case to_pendulum() method which would be able to convert Series of date strings directly to Pendulum objects?

This may lead to Series having various interesting capabilities like, for instance, converting a series of date strings to a series of "offsets from now" - human datetime diffs.

like image 636
alecxe Avatar asked Dec 16 '17 19:12

alecxe


1 Answers

What would be the canonical way to create a custom to_<something> method - in this case to_pendulum() method which would be able to convert Series of date strings directly to Pendulum objects?

After looking through the API a bit, I must say I'm impressed with what they've done. Unfortunately, I don't think Pendulum and pandas can work together (at least, with the current latest version - v0.21).

The most important reason is that pandas does not natively support Pendulum as a datatype. All the natively supported datatypes (np.int, np.float and np.datetime64) all support vectorisation in some form. You are not going to get a shred of performance improvement using a dataframe over, say, a vanilla loop and list. If anything, calling apply on a Series with Pendulum objects is going to be slower (because of all the API overheads).

Another reason is that Pendulum is a subclass of datetime -

from datetime import datetime

isinstance(pendulum.now(), datetime)
True

This is important, because, as mentioned above, datetime is a supported datatype, so pandas will attempt to coerce datetime to pandas' native datetime format - Timestamp. Here's an example.

print(s)

0     2017-11-09 18:43:45
1     2017-11-09 20:15:27
2     2017-11-09 22:29:00
3     2017-11-09 23:42:34
4     2017-11-10 00:09:40
5     2017-11-10 00:23:14
6     2017-11-10 03:32:17
7     2017-11-10 10:59:24
8     2017-11-10 11:12:59
9     2017-11-10 13:49:09

s = s.apply(pendulum.parse)
s

0    2017-11-09 18:43:45+00:00
1    2017-11-09 20:15:27+00:00
2    2017-11-09 22:29:00+00:00
3    2017-11-09 23:42:34+00:00
4    2017-11-10 00:09:40+00:00
5    2017-11-10 00:23:14+00:00
6    2017-11-10 03:32:17+00:00
7    2017-11-10 10:59:24+00:00
8    2017-11-10 11:12:59+00:00
9    2017-11-10 13:49:09+00:00
Name: timestamp, dtype: datetime64[ns, <TimezoneInfo [UTC, GMT, +00:00:00, STD]>]

s[0]
Timestamp('2017-11-09 18:43:45+0000', tz='<TimezoneInfo [UTC, GMT, +00:00:00, STD]>')

type(s[0])
pandas._libs.tslib.Timestamp

So, with some difficulty (involving dtype=object), you could load Pendulum objects into dataframes. Here's how you'd do that -

v = np.vectorize(pendulum.parse)
s = pd.Series(v(s), dtype=object)

s

0     2017-11-09T18:43:45+00:00
1     2017-11-09T20:15:27+00:00
2     2017-11-09T22:29:00+00:00
3     2017-11-09T23:42:34+00:00
4     2017-11-10T00:09:40+00:00
5     2017-11-10T00:23:14+00:00
6     2017-11-10T03:32:17+00:00
7     2017-11-10T10:59:24+00:00
8     2017-11-10T11:12:59+00:00
9     2017-11-10T13:49:09+00:00

s[0]
<Pendulum [2017-11-09T18:43:45+00:00]>

However, this is essentially useless, because calling any pendulum method (via apply) will now not only be super slow, but will also end up in the result being coerced to Timestamp again, an exercise in futility.

like image 118
cs95 Avatar answered Oct 28 '22 14:10

cs95