Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column. (You might recognize my data from the Titanic data set)...
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 Nan
I want to fill the NaN with a value from series 'pclass_lookup':
pclass_lookup
1 38.1
2 29.4
3 25.2
I have tried doing fillna with indexing like:
df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of
ValueError: cannot reindex from a duplicate axis
lambdas were a try too:
df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]
but, that seems not to fill it right, either. Am I totally missing the boat here? '
Definition and Usage The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.
pandas fillna NaN with None Value In order to update the existing DataFrame use df. fillna('None', inplace=True) . You can also use pandas. DataFrame.
In many cases, you will want to replace missing values in a Pandas DataFrame instead of dropping it completely. The fillna method is designed for this. Pandas has a built-in method called dropna. When applied against a DataFrame, the dropna method will remove any rows that contain a NaN value.
Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.
So you need to replace that duff value and then you can just call map to perform the lookup on the NaN
values:
In [317]:
df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 29.4
4 1 38.1
Timings
For a df with 5000 rows:
In [26]:
%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:
%%timeit
def remove_na(x):
if pd.isnull(x['Age']):
return df1[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:
%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop
So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map
is still the fastest.
Building on the response of @vrajs5:
# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values
>>> df
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With