This question is motivated by an answer to a question on improving performance when performing comparisons with DatetimeIndex
in pandas
.
The solution converts the DatetimeIndex
to a numpy
array via df.index.values
and compares the array to a np.datetime64
object. This appears to be the most efficient way to retrieve the Boolean array from this comparison.
The feedback on this question from one of the developers of pandas
was: "These are not the same generally. Offering up a numpy solution is often a special case and not recommended."
My questions are:
DatetimeIndex
offers more functionality, but I require only basic functionality such as slicing and indexing.numpy
?In my research, I found some posts which mention "not always compatible" - but none of them seem to have any conclusive references / documentation, or specify why/when generally they are incompatible. Many other posts use the numpy
representation without comment.
datetime64() format. To convert it to datetime format then you have to use astype() method and just pass the datetime as an argument.
Starting in NumPy 1.7, there are core array data types which natively support datetime functionality. The data type is called datetime64 , so named because datetime is already taken by the Python standard library.
In my opinion, you should always prefer using a Timestamp
- it can easily transform back into a numpy datetime in the case it is needed.
numpy.datetime64
is essentially a thin wrapper for int64
. It has almost no date/time specific functionality.
pd.Timestamp
is a wrapper around a numpy.datetime64
. It is backed by the same int64 value, but supports the entire datetime.datetime
interface, along with useful pandas-specific functionality.
The in-array representation of these two is identical - it is a contigous array of int64s. pd.Timestamp
is a scalar box that makes working with individual values easier.
Going back to the linked answer, you could write it like this, which is shorter and happens to be faster.
%timeit (df.index.values >= pd.Timestamp('2011-01-02').to_datetime64()) & \
(df.index.values < pd.Timestamp('2011-01-03').to_datetime64())
192 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With