When trying to re-assign certain values in a column using df.loc[]
I am getting a strange type conversion error converting datetimes to integers.
Minimal Example:
import numpy as np
import pandas as pd
import datetime
d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))
print(d)
d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6))
print(d)
Full Example:
Here is my dataframe (contains NaNs):
>>> df.head()
prior_ea_date quarter
0 12/31/2015 Q2
1 12/31/2015 Q3
2 12/31/2015 Q3
3 12/31/2015 Q3
4 12/31/2015 Q2
>>> df.prior_ea_date
0 12/31/2015
1 12/31/2015
...
341486 1/19/2016
341487 1/6/2016
Name: prior_ea_date, dtype: object
I want to run the following line of code:
df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True)
where dt
is a string to datetime parser, which when run normally gives:
>>> df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True).head()
0 2015-12-31
1 2015-12-31
2 2015-12-31
3 2015-12-31
4 2015-12-31
Name: prior_ea_date, dtype: datetime64[ns]
However, when I run the .loc[]
I get the following:
>>> df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True)
>>> df.head()
prior_ea_date quarter
0 1451520000000000000 Q2
1 1451520000000000000 Q3
2 1451520000000000000 Q3
3 1451520000000000000 Q3
4 1451520000000000000 Q2
and it has converted my datetime objects to integers.
I have managed to build a temporary work around, so I while any one-line hacks would be appreciated, I would like a pandas style solution.
Thanks.
In order to convert data types in pandas, there are three basic options: Use astype() to force an appropriate dtype. Create a custom function to convert the data. Use pandas functions such as to_numeric() or to_datetime()
DataFrame - loc property The loc property is used to access a group of rows and columns by label(s) or a boolean array.
loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).
We'll start with the first question: how to avoid this behavior?
My understanding is that you want to convert the prior_eta_date
column to datetime objects. The Pandas style approach is to use to_datetime
:
df.prior_ea_date = pd.to_datetime(df.prior_ea_date, format='%m/%d/%Y')
df.prior_ea_date
0 2015-12-31
1 2015-12-31
2 2015-12-31
3 2015-12-31
4 2015-12-31
5 NaT
Name: prior_ea_date, dtype: datetime64[ns]
Your first question is more interesting: why is this happening?
What I think is happening is that when you use df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = ....
you are setting values on a slice of the prior_ea_date
column instead of overwriting the whole column. In this case, Pandas performs and tacit type cast to convert the right hand side to the type of the of the original prior_ea_date
column. Notice that those long integers are epoch times for the wanted dates.
We can see this with your minimal example:
##
# Example of type casting on slice
##
d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))
# Column-a is still dtype: object
d.a
0 12/6/2015
1 NaN
Name: a, dtype: object
d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6))
# Column-a is still dtype: object
d.a
0 1449360000000000000
1 NaN
Name: a, dtype: object
##
# Example of overwriting whole column
##
d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))
d.a = pd.to_datetime(d.a, format='%m/%d/%Y')
# Column-a dtype is now datetime
d.a
0 2015-12-06
1 NaT
Name: a, dtype: datetime64[ns]
FURTHER DETAILS:
In response to the OP's request for more under-the-hood details, I traced the call stack in Pycharm to learn what is going on. The TLDR answer is: ultimately, the unexpected behavior of casting datetime
dtypes into integers is due to Numpy's internal behavior.
d = np.datetime64('2015-12-30T16:00:00.000000000-0800')
d.astype(np.dtype(object))
#>>> 1451520000000000000L
...could you elaborate on why this type casting is happening when using .loc and how to avoid it...
The intuition in my original answer is correct. It is due to the fact that the datetime objects are being cast into generic object
types. This is because setting on the loc
slice preserves the dtype of the column having the values set.
When setting values with loc
, Pandas uses the _LocationIndexer
in the indexing
module. After great deal of checking dimensions and conditions, the line self.obj._data = self.obj._data.setitem(indexer, value)
actually sets the new values.
Stepping into that line, we find that the moment the datetimes are cast into integers, line 742 pandas.core.internals.py
:
values[indexer] = value
In this statement, values
is a Numpy ndarray
of object dtypes. This is the data from the left-hand-side of original assignment. It contains the date strings. The indexer
is just a tuple. And value
is an ndarray
of Numpy datetime64
objects.
This operation uses Numpy's own setitem
methods, which fills individual "cells" with calls to np.asarray(value, self.dtype)
. In your case, self.dtype
is the type of the left-hand-side:object
and the value parameters are in the individual datetimes.
np.asarray(d, np.dtype(object))
#>>> array(1451520000000000000L, dtype=object)
...and how to avoid it...
Don't use loc
. Overwrite the whole column as in my example above.
...I thought having the column with dtype=object would avoid pandas assuming the object type. And either way it seems unexpected to me why it should be converting it to an int when the original column contains strings and NaNs.
Ultimately, the behavior is due to how Numpy implements casting from from datetime to object. Now why does Numpy do it that way? I don't know. That is a good new question and a whole other rabbit hole.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With