This question is motivated by an answer to a question on improving performance when performing comparisons with <code>DatetimeIndex</code> in <code>pandas</code>. The solution converts the <code>DatetimeIndex</code> to a <code>numpy</code> array via <code>df.index.values</code> and compares the array to a <code>np.datetime64</code> object. This appears to be the most efficient way to retrieve the Boolean array from this comparison. The feedback on this question from one of the developers of <code>pandas</code> was: "These are not the same generally. Offering up a numpy solution is often a special case and not recommended." My questions are: <ol> <li>Are they interchangeable for a subset of operations? I appreciate <code>DatetimeIndex</code> offers more functionality, but I require only basic functionality such as slicing and indexing.</li> <li>Are there any documented differences in result for operations that are translatable to <code>numpy</code>?</li> </ol> In my research, I found some posts which mention "not always compatible" - but none of them seem to have any conclusive references / documentation, or specify why/when generally they are incompatible. Many other posts use the <code>numpy</code> representation without comment. <ul> <li>Pandas DatetimeIndex indexing dtype: datetime64 vs Timestamp</li> <li>How to convert from pandas.DatetimeIndex to numpy.datetime64?</li> </ul>

In my opinion, you should always prefer using a <code>Timestamp</code> - it can easily transform back into a numpy datetime in the case it is needed. <code>numpy.datetime64</code> is essentially a thin wrapper for <code>int64</code>. It has almost no date/time specific functionality. <code>pd.Timestamp</code> is a wrapper around a <code>numpy.datetime64</code>. It is backed by the same int64 value, but supports the entire <code>datetime.datetime</code> interface, along with useful pandas-specific functionality. The in-array representation of these two is identical - it is a contigous array of int64s. <code>pd.Timestamp</code> is a scalar box that makes working with individual values easier. Going back to the linked answer, you could write it like this, which is shorter and happens to be faster. <pre class="prettyprint"><code>%timeit (df.index.values >= pd.Timestamp('2011-01-02').to_datetime64()) & \ (df.index.values < pd.Timestamp('2011-01-03').to_datetime64()) 192 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre>

pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

Tags:

python

arrays

datetime

pandas

numpy

This question is motivated by an answer to a question on improving performance when performing comparisons with DatetimeIndex in pandas.

The solution converts the DatetimeIndex to a numpy array via df.index.values and compares the array to a np.datetime64 object. This appears to be the most efficient way to retrieve the Boolean array from this comparison.

The feedback on this question from one of the developers of pandas was: "These are not the same generally. Offering up a numpy solution is often a special case and not recommended."

My questions are:

Are they interchangeable for a subset of operations? I appreciate DatetimeIndex offers more functionality, but I require only basic functionality such as slicing and indexing.
Are there any documented differences in result for operations that are translatable to numpy?

In my research, I found some posts which mention "not always compatible" - but none of them seem to have any conclusive references / documentation, or specify why/when generally they are incompatible. Many other posts use the numpy representation without comment.

Pandas DatetimeIndex indexing dtype: datetime64 vs Timestamp
How to convert from pandas.DatetimeIndex to numpy.datetime64?

634

asked Apr 10 '18 15:04

jpp

1 Answers

In my opinion, you should always prefer using a Timestamp - it can easily transform back into a numpy datetime in the case it is needed.

numpy.datetime64 is essentially a thin wrapper for int64. It has almost no date/time specific functionality.

pd.Timestamp is a wrapper around a numpy.datetime64. It is backed by the same int64 value, but supports the entire datetime.datetime interface, along with useful pandas-specific functionality.

The in-array representation of these two is identical - it is a contigous array of int64s. pd.Timestamp is a scalar box that makes working with individual values easier.

Going back to the linked answer, you could write it like this, which is shorter and happens to be faster.

%timeit (df.index.values >= pd.Timestamp('2011-01-02').to_datetime64()) & \
        (df.index.values < pd.Timestamp('2011-01-03').to_datetime64())
192 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

187

answered Oct 16 '22 18:10

chrisb

Related questions
                            
                                py.test session level fixtures in setup_method
                            
                                TypeError: decoding str is not supported
                            
                                How to override Gunicorn's logging config to use a custom formatter
                            
                                import matplotlib failing with No module named _tkinter on heroku
                            
                                How to split a numpy array in fixed size chunks with and without overlap?
                            
                                Python: Access embedded OLE from Office/Excel document without clipboard
                            
                                About tensorflow Metadata and RunOptions
                            
                                imp module is deprecated in favour of importlib
                            
                                TensorFlow Dataset Shuffle Each Epoch
                            
                                Parse a string of multipart data
                            
                                Why does unpacking this map object print "must be an iterable, not map"?
                            
                                How to use a button to trigger callback updates?
                            
                                How does numpy.reshape() with order = 'F' work?
                            
                                Weighted mse custom loss function in keras
                            
                                Training broke with ResourceExausted error
                            
                                Saving high-resolution images with plotnine
                            
                                Save pandas dataframe with numpy arrays column
                            
                                Fit mixture of Gaussians with fixed covariance in Python
                            
                                Perceptron learning algorithm doesn't work
                            
                                Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With