The notnull() method returns a DataFrame object where all the values are replaced with a Boolean value True for NOT NULL values, and otherwise False.
You can perform the same task using the dot operator. To select multiple columns, you can pass a list of column names to the indexing operator. Alternatively, you can assign all your columns to a list variable and pass that variable to the indexing operator.
You can pass a boolean mask to your df based on notnull()
of 'Survive' column and select the cols of interest:
In [2]:
# make some data
df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
df['Survive'].iloc[2] = np.NaN
df
Out[2]:
Survive Age Fare Group_Size deck Pclass Title
0 1.174206 -0.056846 0.454437 0.496695 1.401509 -2.078731 -1.024832
1 0.036843 1.060134 0.770625 -0.114912 0.118991 -0.317909 0.061022
2 NaN -0.132394 -0.236904 -0.324087 0.570660 0.758084 -0.176421
3 -2.145934 -0.020003 -0.777785 0.835467 1.498284 -1.371325 0.661991
4 -0.197144 -0.089806 -0.706548 1.621260 1.754292 0.725897 0.860482
Now pass a mask to loc
to take only non NaN
rows:
In [3]:
xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
xtrain
Out[3]:
Age Fare Group_Size deck Pclass Title
0 -0.056846 0.454437 0.496695 1.401509 -2.078731 -1.024832
1 1.060134 0.770625 -0.114912 0.118991 -0.317909 0.061022
3 -0.020003 -0.777785 0.835467 1.498284 -1.371325 0.661991
4 -0.089806 -0.706548 1.621260 1.754292 0.725897 0.860482
Two alternatives because... well why not?
Both drop nan
prior to column slicing. That's two call rather than EdChum's one call.
one
df.dropna(subset=['Survive'])[
['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
two
df.query('Survive == Survive')[
['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With