Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I filter dataframe based on null/not null using a column name as a variable?

Tags:

python

pandas

I want to list a dataframe where a specific column is either null or not null, I have it working using -

df[df.Survive.notnull()] # contains no missing values 
df[df.Survive.isnull()]  #---> contains missing values 

This works perfectly but I want to make my code more dynamic and pass the column "Survive" as a variable but it's not working for me.

I tried:

variableToPredict = ['Survive'] 
df[df[variableToPredict].notnull()]

I get the error - ValueError: cannot reindex from a duplicate axis

I'm sure I'm making a silly mistake, what can I do to fix this?

like image 300
Lostsoul Avatar asked Dec 13 '25 19:12

Lostsoul


1 Answers

So idea is always is necessary Series or list or 1d array for mask for filtering.

If want test only one column use scalar:

variableToPredict = 'Survive'
df[df[variableToPredict].notnull()]

But if add [] output is one column DataFrame, so is necessaty change function for test by any (test if at least one NaN per row, sense in multiple columns) or all (test if all NaNs per row, sense in multiple columns) functions:

variableToPredict = ['Survive']
df[df[variableToPredict].notnull().any(axis=1)]

variableToPredict = ['Survive', 'another column']
df[df[variableToPredict].notnull().any(axis=1)]

Sample:

df = pd.DataFrame({'Survive':[np.nan, 'A', 'B', 'B', np.nan],
                   'another column':[np.nan, np.nan, 'a','b','b']})

print (df)
  Survive another column
0     NaN            NaN
1       A            NaN
2       B              a
3       B              b
4     NaN              b

First if test only one column:

variableToPredict = 'Survive'
df1 = df[df[variableToPredict].notnull()]
print (df1)
  Survive another column
1       A            NaN
2       B              a
3       B              b

print (type(df[variableToPredict]))
<class 'pandas.core.series.Series'>

#Series
print (df[variableToPredict])
0    NaN
1      A
2      B
3      B
4    NaN
Name: Survive, dtype: object

print (df[variableToPredict].isnull())
0     True
1    False
2    False
3    False
4     True
Name: Survive, dtype: bool

If use list - here one element list:

variableToPredict = ['Survive']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>

#one element DataFrame
print (df[variableToPredict])
  Survive
0     NaN
1       A
2       B
3       B
4     NaN

If testing per rows it is same output for any or all:

print (df[variableToPredict].notnull().any(axis=1))
0    False
1     True
2     True
3     True
4    False
dtype: bool

print (df[variableToPredict].notnull().all(axis=1))
0    False
1     True
2     True
3     True
4    False
dtype: bool

If test one or more columns in list:

variableToPredict = ['Survive', 'another column']

print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>

print (df[variableToPredict])
  Survive another column
0     NaN            NaN
1       A            NaN
2       B              a
3       B              b
4     NaN              b

print (df[variableToPredict].notnull())
   Survive  another column
0    False           False
1     True           False
2     True            True
3     True            True
4    False            True


#at least one NaN per row, at least one True
print (df[variableToPredict].notnull().any(axis=1))
0    False
1     True
2     True
3     True
4     True
dtype: bool

#all NaNs per row, all Trues 
print (df[variableToPredict].notnull().all(axis=1))
0    False
1    False
2     True
3     True
4    False
dtype: bool
like image 78
jezrael Avatar answered Dec 15 '25 10:12

jezrael