Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas fillna with a lookup table

Tags:

python

pandas

Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column. (You might recognize my data from the Titanic data set)...

    Pclass   Age
0   1        33
1   3        24
2   1        23
3   2        NaN
4   1        Nan

I want to fill the NaN with a value from series 'pclass_lookup':

pclass_lookup
1        38.1
2        29.4
3        25.2

I have tried doing fillna with indexing like:

df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of 
    ValueError: cannot reindex from a duplicate axis

lambdas were a try too:

df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]

but, that seems not to fill it right, either. Am I totally missing the boat here? '

like image 805
zampy Avatar asked Mar 27 '15 04:03

zampy


People also ask

How do I use Fillna in pandas?

Definition and Usage The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.

How do you get Fillna with none pandas?

pandas fillna NaN with None Value In order to update the existing DataFrame use df. fillna('None', inplace=True) . You can also use pandas. DataFrame.

What is the difference between Dropna and Fillna?

In many cases, you will want to replace missing values in a Pandas DataFrame instead of dropping it completely. The fillna method is designed for this. Pandas has a built-in method called dropna. When applied against a DataFrame, the dropna method will remove any rows that contain a NaN value.


2 Answers

Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.

So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:

In [317]:

df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
   Pclass   Age
0       1    33
1       3    24
2       1    23
3       2  29.4
4       1  38.1

Timings

For a df with 5000 rows:

In [26]:

%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:

%%timeit
def remove_na(x):
    if pd.isnull(x['Age']):
        return df1[x['Pclass']]
    else:
        return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:

%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop

So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.

like image 123
EdChum Avatar answered Sep 20 '22 17:09

EdChum


Building on the response of @vrajs5:

# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))

# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values

>>> df
   Pclass   Age
0       1  33.0
1       3  24.0
2       1  23.0
3       2  29.4
4       1  38.1
like image 32
Alexander Avatar answered Sep 20 '22 17:09

Alexander