I am trying to take the rowwise max (and min) of two columns containing dates
from datetime import date
import pandas as pd
import numpy as np
df = pd.DataFrame({'date_a' : [date(2015, 1, 1), date(2012, 6, 1),
date(2013, 1, 1), date(2016, 6, 1)],
'date_b' : [date(2012, 7, 1), date(2013, 1, 1),
date(2014, 3, 1), date(2013, 4, 1)]})
df[['date_a', 'date_b']].max(axis=1)
Out[46]:
0 2015-01-01
1 2013-01-01
2 2014-03-01
3 2016-06-01
as expected. However, if the dataframe contains a single NaN value, the whole operation fails
df_nan = pd.DataFrame({'date_a' : [date(2015, 1, 1), date(2012, 6, 1),
np.NaN, date(2016, 6, 1)],
'date_b' : [date(2012, 7, 1), date(2013, 1, 1),
date(2014, 3, 1), date(2013, 4, 1)]})
df_nan[['date_a', 'date_b']].max(axis=1)
Out[49]:
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
What is going on here? I was expecting this result
0 2015-01-01
1 2013-01-01
2 NaN
3 2016-06-01
How can this be achieved?
I would say the best solution is to use the appropriate dtype
. Pandas provides a very well integrated datetime
dtype
. So note, you are using object
dtypes...
>>> df
date_a date_b
0 2015-01-01 2012-07-01
1 2012-06-01 2013-01-01
2 NaN 2014-03-01
3 2016-06-01 2013-04-01
>>> df.dtypes
date_a object
date_b object
dtype: object
But note, the problem disappears when you use
>>> df2 = df.apply(pd.to_datetime)
>>> df2
date_a date_b
0 2015-01-01 2012-07-01
1 2012-06-01 2013-01-01
2 NaT 2014-03-01
3 2016-06-01 2013-04-01
>>> df2.min(axis=1)
0 2012-07-01
1 2012-06-01
2 2014-03-01
3 2013-04-01
dtype: datetime64[ns]
This appears to happen when date
objects are mixed with floats (such as NaN
) in columns. By default, the numeric_only
flag is set because of the single float value. For example, replace your df_nan
with this:
df_float = pd.DataFrame({'date_a' : [date(2015, 1, 1), date(2012, 6, 1),
1.023, date(2016, 6, 1)],
'date_b' : [date(2012, 7, 1), 3.14,
date(2014, 3, 1), date(2013, 4, 1)]})
print(df_float.max(1))
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
If the flag is manually set to false, this would rightly throw a TypeError
because:
print(date(2015, 1, 1) < 1.0)
TypeError Traceback (most recent call last)
<ipython-input-362-ccbf44ddb40a> in <module>()
1
----> 2 print(date(2015, 1, 1) < 1.0)
TypeError: unorderable types: datetime.date() < float()
However, pandas seems to coerce everything to NaN
. As a workaround, converting to str
using df.astype
appears to do it:
out = df_nan.astype(str).max(1)
print(out)
0 2015-01-01
1 2013-01-01
2 nan
3 2016-06-01
dtype: object
In this case, sorting lexicographically yields the same solution as before.
Otherwise, as juan suggests, you can cast to datetime
using pd.to_datetime
:
out = df_nan.apply(pd.to_datetime, errors='coerce').max(1)
print(out)
0 2015-01-01
1 2013-01-01
2 2014-03-01
3 2016-06-01
dtype: datetime64[ns]
The following should work:
>>> df_nan.where(df_nan.T.notnull().all()).max(axis=1)
Out[1]:
0 2015-01-01
1 2013-01-01
2 None
3 2016-06-01
dtype: object
Where:
df_nan.T.notnull().all()
computes a mask of row containing no np.nan
df_nan.where()
applies the former mask to the dataframe.max(axis=1)
gets the row-wise maximumThis works because the maximum of an array where all values are np.nan
is None
. It allows to keep track of rows where a value is missing by not showing a maximum.
But this decision is up to you, otherwise the solution of @juanpa.arrivillaga that converts NaN
to NaT
is what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With