This is really weird. I have tried several ways of dropping rows with missing data from a pandas dataframe, but none of them seem to work. This is the code (I just uncomment one of the methods used - but these are the three that I used in different modifications - this is the latest):
import pandas as pd
Test = pd.DataFrame({'A':[1,2,3,4,5],'B':[1,2,'NaN',4,5],'C':[1,2,3,'NaT',5]})
print(Test)
#Test = Test.ix[Test.C.notnull()]
#Test = Test.dropna()
Test = Test[~Test[Test.columns.values].isnull()]
print "And now"
print(Test)
But in all cases, all I get is this:
A B C
0 1 1 1
1 2 2 2
2 3 NaN 3
3 4 4 NaT
4 5 5 5
And now
A B C
0 1 1 1
1 2 2 2
2 3 NaN 3
3 4 4 NaT
4 5 5 5
Is there any mistake that I am making? or what is the problem? Ideally, I would like to get this:
A B C
0 1 1 1
1 2 2 2
4 5 5 5
The dropna() method removes the rows that contains NULL values. The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.
dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.
The dropna() function is used to remove missing values. Determine if rows or columns which contain missing values are removed. 0, or 'index' : Drop rows which contain missing values.
Try this on orig data:
Test.replace(["NaN", 'NaT'], np.nan, inplace = True)
Test = Test.dropna()
Test
Or Modify data and do this
import pandas as pd
import numpy as np
Test = pd.DataFrame({'A':[1,2,3,4,5],'B':[1,2,np.nan,4,5],'C':[1,2,3,pd.NaT,5]})
print(Test)
Test = Test.dropna()
print(Test)
A B C
0 1 1.0 1
1 2 2.0 2
4 5 5.0 5
Your example DF has NaN
and NaT
as strings which .dropna
, .notnull
and co. won't consider falsey, so given your example you can use...
df[~df.isin(['NaN', 'NaT']).any(axis=1)]
Which gives you:
A B C
0 1 1 1
1 2 2 2
4 5 5 5
If you had a DF such as (note of the use of np.nan
and np.datetime64('NaT')
instead of strings:
df = pd.DataFrame({'A':[1,2,3,4,5],'B':[1,2,np.nan,4,5],'C':[1,2,3,np.datetime64('NaT'),5]})
Then running df.dropna()
which give you:
A B C
0 1 1.0 1
1 2 2.0 2
4 5 5.0 5
Note that column B
is now a float
instead of an integer as that's required to store NaN
values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With