I'm trying to create a new column in a pandas dataframe with the maximum (or minimum) date from two other date columns. But, when there is a NAN anywhere in either of those columns, the whole min/max column becomes a NAN. What gives? When using number columns this works fine... but with dates, the new column is all NANs. Here's some sample code to illustrate the problem:
df = pd.DataFrame(data=[[np.nan,date(2000,11,1)],
[date(2000,12,1), date(2000,9,1)],
[date(2000,4,1),np.nan],
[date(2000,12,2),np.nan]], columns=['col1','col2'])
df['col3'] = df[['col1','col2']].max(axis=1)
I know it can be done with loc and combination of <, >, isnull and so on. But how to make it work with regular max/min functions?
You can use the pandas max () function to get the maximum value in a given column, multiple columns, or the entire dataframe. The following is the syntax: # df is a pandas dataframe # max value in a column
Min value in a single pandas column To get the minimum value in a pandas column, use the min () function as follows. For example, let’s get the minimum distance the javelin was thrown in the first attempt. We get 79.79 meters as the minimum distance thrown in the “Attemp1”
It shows that our example data consists of seven rows and three columns. In Example 1, I’ll explain how to return the maximum and minimum value contained in a particular pandas DataFrame variable.
We get the minimum value for each of the two columns. 3. Min value for each column in the dataframe Similarly, you can get the min value for each column in the dataframe. Apply the min () function over the entire dataframe instead of a single column or a selection of columns.
You're storing date
objects in your columns, if you convert to datetime
then it works as expected:
In[10]:
df['col1'] = pd.to_datetime(df['col1'])
df['col2'] = pd.to_datetime(df['col2'])
df
Out[10]:
col1 col2 col3
0 NaT 2000-11-01 NaN
1 2000-12-01 2000-09-01 NaN
2 2000-04-01 NaT NaN
3 2000-12-02 NaT NaN
In[11]:
df['col3'] = df[['col1','col2']].max(axis=1)
df
Out[11]:
col1 col2 col3
0 NaT 2000-11-01 2000-11-01
1 2000-12-01 2000-09-01 2000-12-01
2 2000-04-01 NaT 2000-04-01
3 2000-12-02 NaT 2000-12-02
If you simply did:
df['col3'] = df['col1'].max()
this raises a TypeError: '>=' not supported between instances of 'float' and 'datetime.date'
The NaN
values cause the dtype
to be promoted to float
so NaN
gets returned. If you had no missing values then it would work as expected, if you have missing values then you should convert the dtype
to datetime
so that the missing values are converted to NaT
so that max
works correctly
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With