Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, numpy.where(), and numpy.nan

I want to use numpy.where() to add a column to a pandas.DataFrame. I'd like to use NaN values for the rows where the condition is false (to indicate that these values are "missing").

Consider:

>>> import numpy; import pandas
>>> df = pandas.DataFrame({'A':[1,2,3,4]}); print(df)
   A
0  1
1  2
2  3
3  4
>>> df['B'] = numpy.nan
>>> df['C'] = numpy.where(df['A'] < 3, 'yes', numpy.nan)
>>> print(df)
   A   B    C
0  1 NaN  yes
1  2 NaN  yes
2  3 NaN  nan
3  4 NaN  nan
>>> df.isna()
       A     B      C
0  False  True  False
1  False  True  False
2  False  True  False
3  False  True  False

Why does B show "NaN" but C shows "nan"? And why does DataFrame.isna() fail to detect the NaN values in C?

Should I use something other than numpy.nan inside where? None and pandas.NA both seem to work and can be detected by DataFrame.isna(), but I'm not sure these are the best choice.

Thank you!

Edit: As per @Tim Roberts and @DYZ, numpy.where returns an array of type string, so the str constructor is called on numpy.NaN. The values in column C are actually strings "nan". The question remains, however: what is the most elegant thing to do here? Should I use None? Or something else?

like image 979
Duncan MacIntyre Avatar asked May 10 '21 21:05

Duncan MacIntyre


People also ask

Where numpy array is NaN?

To check for NaN values in a Numpy array you can use the np. isnan() method. This outputs a boolean mask of the size that of the original array. The output array has true for the indices which are NaNs in the original array and false for the rest.

What is numpy NaN?

In Python, NumPy NAN stands for not a number and is defined as a substitute for declaring value which are numerical values that are missing values in an array as NumPy is used to deal with arrays in Python and this can be initialized using numpy.

Is NaN and null same in pandas?

Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame : isnull() notnull()


1 Answers

np.where coerces the second and the third parameter to the same datatype. Since the second parameter is a string, the third one is converted to a string, too, by calling function str():

str(numpy.nan)
# 'nan'

As the result, the values in column C are all strings.

You can first fill the NaN rows with None and then convert them to np.nan with fillna():

df['C'] = numpy.where(df['A'] < 3, 'yes', None)
df['C'].fillna(np.nan, inplace=True)
like image 55
DYZ Avatar answered Oct 22 '22 15:10

DYZ