I seem to be running into unexpectedly slow performance of arithmetic operations on pandas.Timestamp vs python regular datetime() objects.
Here is a benchmark that demonstrates:
import datetime
import pandas
import numpy
# using datetime:
def test1():
d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
delta = datetime.timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using pandas:
def test2():
d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
delta = pandas.Timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using numpy
def test3():
d1 = numpy.datetime64('2015-03-20 10:00:00')
d2 = numpy.datetime64('2015-03-20 10:00:15')
delta = numpy.timedelta64(30, 'm')
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
time1 = datetime.datetime.now()
test1()
time2 = datetime.datetime.now()
test2()
time3 = datetime.datetime.now()
test3()
time4 = datetime.datetime.now()
print('DELTA test1: ' + str(time2-time1))
print('DELTA test2: ' + str(time3-time2))
print('DELTA test3: ' + str(time4-time3))
And corresponding results on my machine (python3.3, pandas 0.15.2):
DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389
Is this expected?
Are there ways to eliminate the performance problem other than switching code to Python's default datetime implementation as much as possible?
Timestamp is the pandas equivalent of python's Datetime and is interchangeable with it in most cases. It's the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas. Value to be converted to Timestamp.
pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.
Comparison between pandas timestamp objects is carried out using simple comparison operators: >, <,==,< = , >=. The difference can be calculated using a simple '–' operator. Given time can be converted to pandas timestamp using pandas. Timestamp() method.
Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.
I've got similar results on my machine:
$ python -mtimeit -s "from datetime import datetime, timedelta; d1, d2 = datetime(2015, 3, 20, 10, 0, 0), datetime(2015, 3, 20, 10, 0, 15); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.107 usec per loop
$ python -mtimeit -s "from numpy import datetime64, timedelta64; d1, d2 = datetime64('2015-03-20T10:00:00Z'), datetime64('2015-03-20T10:00:15Z'); delta = timedelta64(30, 'm')" "(d2 - d1) > delta"
100000 loops, best of 3: 5.35 usec per loop
$ python -mtimeit -s "from pandas import Timestamp, Timedelta; d1, d2 = Timestamp('2015-03-20T10:00:00Z'), Timestamp('2015-03-20T10:00:15Z'); delta = Timedelta(minutes=30)" "(d2 - d1) > delta"
10000 loops, best of 3: 19.9 usec per loop
datetime
is several times faster than corresponding numpy
, pandas
analogs.
$ python -c "import numpy, pandas; print(numpy.__version__, pandas.__version__)"
('1.9.2', '0.15.2')
It is not clear why the difference is so large. It is true that numpy
, pandas
code is optimized for vectorized operations. But it is not obvious why these particular scalar operations are two orders of magnitude slower e.g., the adding of the explicit timezone does not slowdown datetime.datetime
code:
$ python3 -mtimeit -s "from datetime import datetime, timedelta, timezone; d1, d2 = datetime(2015, 3, 20, 10, 0, 0, tzinfo=timezone.utc), datetime(2015, 3, 20, 10, 0, 15, tzinfo=timezone.utc); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.0939 usec per loop
To workaround the issue, you could try to convert native date/time types en masse into simpler (faster) analogs (e.g., POSIX timestamp represented as a float) if you can't use vectorized operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With