Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing NaN values in python pandas

Tags:

python

pandas

csv

Data is of income of adults from census data, rows look like:

31, Private, 84154, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 38, NaN, >50K
48, Self-emp-not-inc, 265477, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K

I'm trying to remove all rows with NaNs from a DataFrame loaded from a CSV file in pandas.

>>> import pandas as pd
>>> income = pd.read_csv('income.data')
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object)
>>> income.dropna(how='any') # should drop all rows with NaNs
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object)
    Self-emp-inc, nan], dtype=object) # what??
>>> income = income.dropna(how='any') # ok, maybe reassignment will work?
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object) # what??

I tried with a smaller example.csv:

label,age,sex
1,43,M
-1,NaN,F
1,65,NaN

And dropna() worked just fine here for both categorical and numerical NaNs. What is going on? I'm new to Pandas, just learning the ropes.

like image 427
lollercoaster Avatar asked Oct 21 '25 16:10

lollercoaster


2 Answers

As I wrote in the comment: The "NaN" has a leading whitespace (at least in the data you provided). Therefore, you need to specifiy the na_values paramter in the read_csv function.

Try this one:

df = pd.read_csv("income.csv",header=None,na_values=" NaN")

This is why your second example works, because there is no leading whitespace here.

like image 194
dorvak Avatar answered Oct 23 '25 05:10

dorvak


Drop all rows with NaN values

df2=df.dropna()
df2=df.dropna(axis=0)

Reset index after drop

df2=df.dropna().reset_index(drop=True)

Drop row that has all NaN values

df2=df.dropna(how='all')

Drop rows that has NaN values on selected columns

df2=df.dropna(subset=['length','Height'])
like image 24
Francis Odero Avatar answered Oct 23 '25 07:10

Francis Odero



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!