When trying to re-assign certain values in a column using <code>df.loc[]</code> I am getting a strange type conversion error converting datetimes to integers. Minimal Example: <pre class="prettyprint"><code>import numpy as np import pandas as pd import datetime d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab')) print(d) d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6)) print(d) </code></pre> Full Example: Here is my dataframe (contains NaNs): <pre class="prettyprint"><code>>>> df.head() prior_ea_date quarter 0 12/31/2015 Q2 1 12/31/2015 Q3 2 12/31/2015 Q3 3 12/31/2015 Q3 4 12/31/2015 Q2 >>> df.prior_ea_date 0 12/31/2015 1 12/31/2015 ... 341486 1/19/2016 341487 1/6/2016 Name: prior_ea_date, dtype: object </code></pre> I want to run the following line of code: <pre class="prettyprint"><code>df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True) </code></pre> where <code>dt</code> is a string to datetime parser, which when run normally gives: <pre class="prettyprint"><code>>>> df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True).head() 0 2015-12-31 1 2015-12-31 2 2015-12-31 3 2015-12-31 4 2015-12-31 Name: prior_ea_date, dtype: datetime64[ns] </code></pre> However, when I run the <code>.loc[]</code> I get the following: <pre class="prettyprint"><code>>>> df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True) >>> df.head() prior_ea_date quarter 0 1451520000000000000 Q2 1 1451520000000000000 Q3 2 1451520000000000000 Q3 3 1451520000000000000 Q3 4 1451520000000000000 Q2 </code></pre> and it has converted my datetime objects to integers. <ul> <li>Why is this happening?</li> <li>How do I avoid this behavior?</li> </ul> I have managed to build a temporary work around, so I while any one-line hacks would be appreciated, I would like a pandas style solution. Thanks.

We'll start with the first question: how to avoid this behavior? My understanding is that you want to convert the <code>prior_eta_date</code> column to datetime objects. The Pandas style approach is to use <code>to_datetime</code>: <pre class="prettyprint"><code>df.prior_ea_date = pd.to_datetime(df.prior_ea_date, format='%m/%d/%Y') df.prior_ea_date 0 2015-12-31 1 2015-12-31 2 2015-12-31 3 2015-12-31 4 2015-12-31 5 NaT Name: prior_ea_date, dtype: datetime64[ns] </code></pre> Your first question is more interesting: why is this happening? What I think is happening is that when you use <code>df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = ....</code> you are setting values on a slice of the <code>prior_ea_date</code> column instead of overwriting the whole column. In this case, Pandas performs and tacit type cast to convert the right hand side to the type of the of the original <code>prior_ea_date</code> column. Notice that those long integers are epoch times for the wanted dates. We can see this with your minimal example: <pre class="prettyprint"><code>## # Example of type casting on slice ## d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab')) # Column-a is still dtype: object d.a 0 12/6/2015 1 NaN Name: a, dtype: object d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6)) # Column-a is still dtype: object d.a 0 1449360000000000000 1 NaN Name: a, dtype: object ## # Example of overwriting whole column ## d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab')) d.a = pd.to_datetime(d.a, format='%m/%d/%Y') # Column-a dtype is now datetime d.a 0 2015-12-06 1 NaT Name: a, dtype: datetime64[ns] </code></pre> FURTHER DETAILS: In response to the OP's request for more under-the-hood details, I traced the call stack in Pycharm to learn what is going on. The TLDR answer is: ultimately, the unexpected behavior of casting <code>datetime</code> dtypes into integers is due to Numpy's internal behavior. <pre class="prettyprint"><code>d = np.datetime64('2015-12-30T16:00:00.000000000-0800') d.astype(np.dtype(object)) #>>> 1451520000000000000L </code></pre> ...could you elaborate on why this type casting is happening when using .loc and how to avoid it... The intuition in my original answer is correct. It is due to the fact that the datetime objects are being cast into generic <code>object</code> types. This is because setting on the <code>loc</code> slice preserves the dtype of the column having the values set. When setting values with <code>loc</code>, Pandas uses the <code>_LocationIndexer</code> in the <code>indexing</code> module. After great deal of checking dimensions and conditions, the line <code>self.obj._data = self.obj._data.setitem(indexer, value)</code> actually sets the new values. Stepping into that line, we find that the moment the datetimes are cast into integers, line 742 <code>pandas.core.internals.py</code>: <pre class="prettyprint"><code>values[indexer] = value </code></pre> In this statement, <code>values</code> is a Numpy <code>ndarray</code> of object dtypes. This is the data from the left-hand-side of original assignment. It contains the date strings. The <code>indexer</code> is just a tuple. And <code>value</code> is an <code>ndarray</code> of Numpy <code>datetime64</code> objects. This operation uses Numpy's own <code>setitem</code> methods, which fills individual "cells" with calls to <code>np.asarray(value, self.dtype)</code>. In your case, <code>self.dtype</code> is the type of the left-hand-side:<code>object</code> and the value parameters are in the individual datetimes. <pre class="prettyprint"><code>np.asarray(d, np.dtype(object)) #>>> array(1451520000000000000L, dtype=object) </code></pre> ...and how to avoid it... Don't use <code>loc</code>. Overwrite the whole column as in my example above. ...I thought having the column with dtype=object would avoid pandas assuming the object type. And either way it seems unexpected to me why it should be converting it to an int when the original column contains strings and NaNs. Ultimately, the behavior is due to how Numpy implements casting from from datetime to object. Now why does Numpy do it that way? I don't know. That is a good new question and a whole other rabbit hole.

Pandas: Type conversion using `df.loc` from datetime64 to int

Tags:

python

datetime

type-conversion

pandas

When trying to re-assign certain values in a column using df.loc[] I am getting a strange type conversion error converting datetimes to integers.

Minimal Example:

import numpy as np
import pandas as pd
import datetime
d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))
print(d)
d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6))
print(d)

Full Example:

Here is my dataframe (contains NaNs):

>>> df.head()

  prior_ea_date quarter
0    12/31/2015      Q2
1    12/31/2015      Q3
2    12/31/2015      Q3
3    12/31/2015      Q3
4    12/31/2015      Q2

>>> df.prior_ea_date

0         12/31/2015
1         12/31/2015
...
341486     1/19/2016
341487      1/6/2016
Name: prior_ea_date, dtype: object

I want to run the following line of code:

df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True)

where dt is a string to datetime parser, which when run normally gives:

>>> df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True).head()

0   2015-12-31
1   2015-12-31
2   2015-12-31
3   2015-12-31
4   2015-12-31
Name: prior_ea_date, dtype: datetime64[ns]

However, when I run the .loc[] I get the following:

>>> df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = df.prior_ea_date[pd.notnull(df.prior_ea_date)].apply(dt, usa=True)
>>> df.head()

         prior_ea_date quarter
0  1451520000000000000      Q2
1  1451520000000000000      Q3
2  1451520000000000000      Q3
3  1451520000000000000      Q3
4  1451520000000000000      Q2

and it has converted my datetime objects to integers.

Why is this happening?
How do I avoid this behavior?

I have managed to build a temporary work around, so I while any one-line hacks would be appreciated, I would like a pandas style solution.

Thanks.

995

asked Aug 19 '16 15:08

oliversm

1 Answers

We'll start with the first question: how to avoid this behavior?

My understanding is that you want to convert the prior_eta_date column to datetime objects. The Pandas style approach is to use to_datetime:

df.prior_ea_date = pd.to_datetime(df.prior_ea_date, format='%m/%d/%Y')
df.prior_ea_date

0   2015-12-31
1   2015-12-31
2   2015-12-31
3   2015-12-31
4   2015-12-31
5          NaT
Name: prior_ea_date, dtype: datetime64[ns]

Your first question is more interesting: why is this happening?

What I think is happening is that when you use df.loc[pd.notnull(df.prior_ea_date), 'prior_ea_date'] = .... you are setting values on a slice of the prior_ea_date column instead of overwriting the whole column. In this case, Pandas performs and tacit type cast to convert the right hand side to the type of the of the original prior_ea_date column. Notice that those long integers are epoch times for the wanted dates.

We can see this with your minimal example:

##
# Example of type casting on slice
##

d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))

# Column-a is still dtype: object
d.a
0    12/6/2015
1          NaN
Name: a, dtype: object

d.loc[pd.notnull(d.a), 'a'] = d.a[pd.notnull(d.a)].apply(lambda x: datetime.datetime(2015,12,6))

# Column-a is still dtype: object
d.a
0    1449360000000000000
1                    NaN
Name: a, dtype: object

##
# Example of overwriting whole column
##

d = pd.DataFrame(zip(['12/6/2015', np.nan], [1, 2]), columns=list('ab'))
d.a = pd.to_datetime(d.a, format='%m/%d/%Y')

# Column-a dtype is now datetime
d.a
0   2015-12-06
1          NaT
Name: a, dtype: datetime64[ns]

FURTHER DETAILS:

In response to the OP's request for more under-the-hood details, I traced the call stack in Pycharm to learn what is going on. The TLDR answer is: ultimately, the unexpected behavior of casting datetime dtypes into integers is due to Numpy's internal behavior.

d = np.datetime64('2015-12-30T16:00:00.000000000-0800')
d.astype(np.dtype(object))
#>>> 1451520000000000000L

...could you elaborate on why this type casting is happening when using .loc and how to avoid it...

The intuition in my original answer is correct. It is due to the fact that the datetime objects are being cast into generic object types. This is because setting on the loc slice preserves the dtype of the column having the values set.

When setting values with loc, Pandas uses the _LocationIndexer in the indexing module. After great deal of checking dimensions and conditions, the line self.obj._data = self.obj._data.setitem(indexer, value) actually sets the new values.

Stepping into that line, we find that the moment the datetimes are cast into integers, line 742 pandas.core.internals.py:

values[indexer] = value

In this statement, values is a Numpy ndarray of object dtypes. This is the data from the left-hand-side of original assignment. It contains the date strings. The indexer is just a tuple. And value is an ndarray of Numpy datetime64 objects.

This operation uses Numpy's own setitem methods, which fills individual "cells" with calls to np.asarray(value, self.dtype). In your case, self.dtype is the type of the left-hand-side:object and the value parameters are in the individual datetimes.

np.asarray(d, np.dtype(object))
#>>> array(1451520000000000000L, dtype=object)

...and how to avoid it...
Don't use loc. Overwrite the whole column as in my example above.

...I thought having the column with dtype=object would avoid pandas assuming the object type. And either way it seems unexpected to me why it should be converting it to an int when the original column contains strings and NaNs.

Ultimately, the behavior is due to how Numpy implements casting from from datetime to object. Now why does Numpy do it that way? I don't know. That is a good new question and a whole other rabbit hole.

111

answered Sep 18 '22 12:09

andrew

Related questions
                            
                                How to make a local Pypi mirror without internet access and with search available?
                            
                                Detecting events in a pandas data frame
                            
                                Matplotlib: combine legend with same color and name
                            
                                How to find and query a specific build in Jenkins using the Python Jenkins API
                            
                                Command failed: tar xzf android-sdk_r20-linux.tgz
                            
                                Add columns to a pivot table (pandas)
                            
                                Can metaclass be any callable?
                            
                                Why is Python 3 looking in my Python 2.7 package directory for packages?
                            
                                How to read in an edge list to make a scipy sparse matrix
                            
                                Pandas `period_range` gives strange results
                            
                                Python, create shortcut with two paths and argument
                            
                                How to clear matplotlib labels in legend?
                            
                                Save or export weights and biases in TensorFlow for non-Python replication
                            
                                Towards limiting the big RDD
                            
                                IPython: How to show the same plot in different cells?
                            
                                Is python smart enough to replace function calls with constant result?
                            
                                How to use eventlet library for async gunicorn workers
                            
                                Python+kivy+SQLite: How to use them together
                            
                                How to predict new values using statsmodels.formula.api (python)
                            
                                How to load table from SQLLite db file from PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With