I've encountered a strange problem. I'm sure there is a logical reason behind this.
I have a dataframe called alloptions that has 4 columns, minage1, minage2, minage3, and minage4, which are all float64. the number of missing values increases from minage1 to minage4.
I create a fifth column that takes the minimum of these four columns:
alloptions['minage']=alloptions.apply(lambda x: min([x['minage1'],x['minage2'],x['minage3'],x['minage4']]),axis=1)
which looked like it worked until i discovered that in row 47
minage1 minage2 minage3 minage4 minage
47 NaN 56.0 NaN NaN NaN
using .loc, I isolate that row:
In [10]:
print alloptions.loc[47,:]
print alloptions.loc[47,:].dtypes
I get
minage1 NaN
minage2 56
minage3 NaN
minage4 NaN
minage NaN
Name: 47, dtype: float64
float64
so I'm confused as to why the function didn't pick up 56.
Thank you in advance for your help.
You are using the builtin Python min
function, which doesn't know about nan
and treats it inconsistently:
>>> min(1, np.nan)
1
>>> min(np.nan, 1)
nan
Instead, use the min
method from pandas
, which knows to ignore nan
values when computing the min. This method takes an axis
argument, so if your four minageX
columns are the only columns in your DataFrame, you can just do
df['minage'] = df.min(axis=1)
In general when working with pandas data structures you should avoid using builtin Python functions like max, min, sum, and so on, and instead use the pandas versions; the builtin functions do not know anything about pandas or about vectorized operations, and may give unexpected results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With