Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow performance of pandas timestamp vs datetime

I seem to be running into unexpectedly slow performance of arithmetic operations on pandas.Timestamp vs python regular datetime() objects.
Here is a benchmark that demonstrates:

import datetime
import pandas
import numpy

# using datetime:
def test1():
    d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
    d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
    delta = datetime.timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using pandas:
def test2():
    d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
    d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
    delta = pandas.Timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using numpy
def test3():
    d1 = numpy.datetime64('2015-03-20 10:00:00')
    d2 = numpy.datetime64('2015-03-20 10:00:15')
    delta = numpy.timedelta64(30, 'm')

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1


  time1 = datetime.datetime.now()
  test1()
  time2 = datetime.datetime.now()
  test2()
  time3 = datetime.datetime.now()
  test3()
  time4 = datetime.datetime.now()

  print('DELTA test1: ' + str(time2-time1))
  print('DELTA test2: ' + str(time3-time2))
  print('DELTA test3: ' + str(time4-time3))

And corresponding results on my machine (python3.3, pandas 0.15.2):

DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389

Is this expected?
Are there ways to eliminate the performance problem other than switching code to Python's default datetime implementation as much as possible?

like image 538
Gregory Shklover Avatar asked Mar 21 '15 19:03

Gregory Shklover


People also ask

What is the difference between Timestamp and datetime in pandas?

Timestamp is the pandas equivalent of python's Datetime and is interchangeable with it in most cases. It's the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas. Value to be converted to Timestamp.

Is pandas good for time series?

pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.

How do I compare Panda timestamps?

Comparison between pandas timestamp objects is carried out using simple comparison operators: >, <,==,< = , >=. The difference can be calculated using a simple '–' operator. Given time can be converted to pandas timestamp using pandas. Timestamp() method.

What does datetime do in pandas?

Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.


1 Answers

I've got similar results on my machine:

$ python -mtimeit -s "from datetime import datetime, timedelta; d1, d2 = datetime(2015, 3, 20, 10, 0, 0), datetime(2015, 3, 20, 10, 0, 15); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.107 usec per loop
$ python -mtimeit -s "from numpy import datetime64, timedelta64; d1, d2 = datetime64('2015-03-20T10:00:00Z'), datetime64('2015-03-20T10:00:15Z'); delta = timedelta64(30, 'm')" "(d2 - d1) > delta"
100000 loops, best of 3: 5.35 usec per loop
$ python -mtimeit -s "from pandas import Timestamp, Timedelta; d1, d2 = Timestamp('2015-03-20T10:00:00Z'), Timestamp('2015-03-20T10:00:15Z'); delta = Timedelta(minutes=30)" "(d2 - d1) > delta"
10000 loops, best of 3: 19.9 usec per loop

datetime is several times faster than corresponding numpy, pandas analogs.

$ python -c "import numpy, pandas; print(numpy.__version__, pandas.__version__)"
('1.9.2', '0.15.2')

It is not clear why the difference is so large. It is true that numpy, pandas code is optimized for vectorized operations. But it is not obvious why these particular scalar operations are two orders of magnitude slower e.g., the adding of the explicit timezone does not slowdown datetime.datetime code:

$ python3 -mtimeit -s "from datetime import datetime, timedelta, timezone; d1, d2 = datetime(2015, 3, 20, 10, 0, 0, tzinfo=timezone.utc), datetime(2015, 3, 20, 10, 0, 15, tzinfo=timezone.utc); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.0939 usec per loop

To workaround the issue, you could try to convert native date/time types en masse into simpler (faster) analogs (e.g., POSIX timestamp represented as a float) if you can't use vectorized operations.

like image 122
jfs Avatar answered Nov 06 '22 18:11

jfs