Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing pure python datetime.datetime in pandas DataFrame

Since matplotlib doesn't support eitherpandas.TimeStamp ornumpy.datetime64, and there are no simple workarounds, I decided to convert a native pandas date column into a pure python datetime.datetime so that scatter plots are easier to make.

However:

t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31')]})
t.dtypes # date    datetime64[ns], as expected
pure_python_datetime_array = t.date.dt.to_pydatetime() # works fine
t['date'] = pure_python_datetime_array # doesn't do what I hoped
t.dtypes # date    datetime64[ns] as before, no luck changing it

I'm guessing pandas auto-converts the pure python datetime produced by to_pydatetime into its native format. I guess it's convenient behavior in general, but is there a way to override it?

like image 965
max Avatar asked Sep 01 '16 17:09

max


People also ask

Can pandas store datetime?

pandas supports converting integer or float epoch times to Timestamp and DatetimeIndex . The default unit is nanoseconds, since that is how Timestamp objects are stored internally. However, epochs are often stored in another unit which can be specified.


2 Answers

The use of to_pydatetime() is correct.

In [87]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')]})

In [88]: t.date.dt.to_pydatetime()
Out[88]: 
array([datetime.datetime(2012, 12, 31, 0, 0),
       datetime.datetime(2013, 12, 31, 0, 0)], dtype=object)

When you assign it back to t.date, it automatically converts it back to datetime64

pandas.Timestamp is a datetime subclass anyway :)

One way to do the plot is to convert the datetime to int64:

In [117]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')], 'sample_data': [1, 2]})

In [118]: t['date_int'] = t.date.astype(np.int64)

In [119]: t
Out[119]: 
        date  sample_data             date_int
0 2012-12-31            1  1356912000000000000
1 2013-12-31            2  1388448000000000000

In [120]: t.plot(kind='scatter', x='date_int', y='sample_data')
Out[120]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3c852662d0>

In [121]: plt.show()

enter image description here

Another workaround is (to not use scatter, but ...):

In [126]: t.plot(x='date', y='sample_data', style='.')
Out[126]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3c850f5750>

And, the last work around:

In [141]: import matplotlib.pyplot as plt

In [142]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')], 'sample_data': [100, 20000]})

In [143]: t
Out[143]: 
        date  sample_data
0 2012-12-31          100
1 2013-12-31        20000
In [144]: plt.scatter(t.date.dt.to_pydatetime()  , t.sample_data)
Out[144]: <matplotlib.collections.PathCollection at 0x7f3c84a10510>

In [145]: plt.show()

enter image description here

This has an issue at github, which is open as of now.

like image 161
Nehal J Wani Avatar answered Oct 02 '22 13:10

Nehal J Wani


Here is a possible solution with the Series class from pandas:

t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31')]})
t.dtypes # date    datetime64[ns], as expected
pure_python_datetime_array = t.date.dt.to_pydatetime() # works fine
t['date'] = pd.Series(pure_python_datetime_array, dtype=object) # should do what you expect
t.dtypes # object, but the type of the date column is now correct! datetime
type(t.values[0, 0]) # datetime, now you can access the datetime object directly

Why is this working? My assumption is, that you force the dtype for the column date to be an object. So that pandas does not do any intern conversion from datetime.datetime to datetime64.

Correct me otherwise, if I am wrong.

like image 27
PiMathCLanguage Avatar answered Oct 02 '22 12:10

PiMathCLanguage