Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Type conversion using `df.loc` from datetime64 to int

When trying to re-assign certain values in a column using df.loc[] I am getting a strange type conversion error converting datetimes to integers.

Minimal Example:

import numpy as np
import pandas as pd
import datetime
d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))
print(d)
d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6))
print(d)

Full Example:

Here is my dataframe (contains NaNs):

>>> df.head()

  prior_ea_date quarter
0    12/31/2015      Q2
1    12/31/2015      Q3
2    12/31/2015      Q3
3    12/31/2015      Q3
4    12/31/2015      Q2

>>> df.prior_ea_date

0         12/31/2015
1         12/31/2015
...
341486     1/19/2016
341487      1/6/2016
Name: prior_ea_date, dtype: object

I want to run the following line of code:

df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True)

where dt is a string to datetime parser, which when run normally gives:

>>> df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True).head()

0   2015-12-31
1   2015-12-31
2   2015-12-31
3   2015-12-31
4   2015-12-31
Name: prior_ea_date, dtype: datetime64[ns]

However, when I run the .loc[] I get the following:

>>> df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True)
>>> df.head()

         prior_ea_date quarter
0  1451520000000000000      Q2
1  1451520000000000000      Q3
2  1451520000000000000      Q3
3  1451520000000000000      Q3
4  1451520000000000000      Q2

and it has converted my datetime objects to integers.

  • Why is this happening?
  • How do I avoid this behavior?

I have managed to build a temporary work around, so I while any one-line hacks would be appreciated, I would like a pandas style solution.

Thanks.

like image 995
oliversm Avatar asked Aug 19 '16 15:08

oliversm


People also ask

How do you convert Dtype to pandas?

In order to convert data types in pandas, there are three basic options: Use astype() to force an appropriate dtype. Create a custom function to convert the data. Use pandas functions such as to_numeric() or to_datetime()

What does DF LOC do in pandas?

DataFrame - loc property The loc property is used to access a group of rows and columns by label(s) or a boolean array.

How do I use Loc ILOC?

loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).


1 Answers

We'll start with the first question: how to avoid this behavior?

My understanding is that you want to convert the prior_eta_date column to datetime objects. The Pandas style approach is to use to_datetime:

df.prior_ea_date = pd.to_datetime(df.prior_ea_date, format='%m/%d/%Y')
df.prior_ea_date

0   2015-12-31
1   2015-12-31
2   2015-12-31
3   2015-12-31
4   2015-12-31
5          NaT
Name: prior_ea_date, dtype: datetime64[ns]

Your first question is more interesting: why is this happening?

What I think is happening is that when you use df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = .... you are setting values on a slice of the prior_ea_date column instead of overwriting the whole column. In this case, Pandas performs and tacit type cast to convert the right hand side to the type of the of the original prior_ea_date column. Notice that those long integers are epoch times for the wanted dates.

We can see this with your minimal example:

##
# Example of type casting on slice
##

d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))

# Column-a is still dtype: object
d.a
0    12/6/2015
1          NaN
Name: a, dtype: object

d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6))

# Column-a is still dtype: object
d.a
0    1449360000000000000
1                    NaN
Name: a, dtype: object

##
# Example of overwriting whole column
##

d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))
d.a = pd.to_datetime(d.a, format='%m/%d/%Y')

# Column-a dtype is now datetime
d.a
0   2015-12-06
1          NaT
Name: a, dtype: datetime64[ns]

FURTHER DETAILS:

In response to the OP's request for more under-the-hood details, I traced the call stack in Pycharm to learn what is going on. The TLDR answer is: ultimately, the unexpected behavior of casting datetime dtypes into integers is due to Numpy's internal behavior.

d = np.datetime64('2015-12-30T16:00:00.000000000-0800')
d.astype(np.dtype(object))
#>>> 1451520000000000000L

...could you elaborate on why this type casting is happening when using .loc and how to avoid it...

The intuition in my original answer is correct. It is due to the fact that the datetime objects are being cast into generic object types. This is because setting on the loc slice preserves the dtype of the column having the values set.

When setting values with loc, Pandas uses the _LocationIndexer in the indexing module. After great deal of checking dimensions and conditions, the line self.obj._data = self.obj._data.setitem(indexer, value) actually sets the new values.

Stepping into that line, we find that the moment the datetimes are cast into integers, line 742 pandas.core.internals.py:

values[indexer] = value  

In this statement, values is a Numpy ndarray of object dtypes. This is the data from the left-hand-side of original assignment. It contains the date strings. The indexer is just a tuple. And value is an ndarray of Numpy datetime64 objects.

This operation uses Numpy's own setitem methods, which fills individual "cells" with calls to np.asarray(value, self.dtype). In your case, self.dtype is the type of the left-hand-side:object and the value parameters are in the individual datetimes.

np.asarray(d, np.dtype(object))
#>>> array(1451520000000000000L, dtype=object)

...and how to avoid it...
Don't use loc. Overwrite the whole column as in my example above.

...I thought having the column with dtype=object would avoid pandas assuming the object type. And either way it seems unexpected to me why it should be converting it to an int when the original column contains strings and NaNs.

Ultimately, the behavior is due to how Numpy implements casting from from datetime to object. Now why does Numpy do it that way? I don't know. That is a good new question and a whole other rabbit hole.

like image 111
andrew Avatar answered Sep 18 '22 12:09

andrew