Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column. (You might recognize my data from the Titanic data set)... <pre class="prettyprint"><code> Pclass Age 0 1 33 1 3 24 2 1 23 3 2 NaN 4 1 Nan </code></pre> I want to fill the NaN with a value from series 'pclass_lookup': <pre class="prettyprint"><code>pclass_lookup 1 38.1 2 29.4 3 25.2 </code></pre> I have tried doing fillna with indexing like: <pre class="prettyprint"><code>df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of ValueError: cannot reindex from a duplicate axis </code></pre> lambdas were a try too: <pre class="prettyprint"><code>df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass] </code></pre> but, that seems not to fill it right, either. Am I totally missing the boat here? '

Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced. So you need to replace that duff value and then you can just call map to perform the lookup on the <code>NaN</code> values: <pre class="prettyprint"><code>In [317]: df.Age.replace('Nan', np.NaN, inplace=True) df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup) df Out[317]: Pclass Age 0 1 33 1 3 24 2 1 23 3 2 29.4 4 1 38.1 </code></pre> Timings For a df with 5000 rows: <pre class="prettyprint"><code>In [26]: %timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup) 100 loops, best of 3: 2.41 ms per loop In [27]: %%timeit def remove_na(x): if pd.isnull(x['Age']): return df1[x['Pclass']] else: return x['Age'] df['Age'] =df.apply(remove_na, axis=1) 1 loops, best of 3: 278 ms per loop In [28]: %%timeit nulls = df.loc[df.Age.isnull(), 'Pclass'] df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values 100 loops, best of 3: 3.37 ms per loop </code></pre> So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but <code>map</code> is still the fastest.

Pandas fillna with a lookup table

Tags:

python

pandas

Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column. (You might recognize my data from the Titanic data set)...

    Pclass   Age
0   1        33
1   3        24
2   1        23
3   2        NaN
4   1        Nan

I want to fill the NaN with a value from series 'pclass_lookup':

pclass_lookup
1        38.1
2        29.4
3        25.2

I have tried doing fillna with indexing like:

df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of 
    ValueError: cannot reindex from a duplicate axis

lambdas were a try too:

df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]

but, that seems not to fill it right, either. Am I totally missing the boat here? '

805

asked Mar 27 '15 04:03

zampy

2 Answers

Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.

So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:

In [317]:

df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
   Pclass   Age
0       1    33
1       3    24
2       1    23
3       2  29.4
4       1  38.1

Timings

For a df with 5000 rows:

In [26]:

%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:

%%timeit
def remove_na(x):
    if pd.isnull(x['Age']):
        return df1[x['Pclass']]
    else:
        return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:

%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop

So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.

123

answered Sep 20 '22 17:09

EdChum

Building on the response of @vrajs5:

# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))

# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values

>>> df
   Pclass   Age
0       1  33.0
1       3  24.0
2       1  23.0
3       2  29.4
4       1  38.1

answered Sep 20 '22 17:09

Alexander

Related questions
                            
                                Implementing Chain of responsibility pattern in python using coroutines
                            
                                How to read constituency based parse tree
                            
                                What's the best way of distinguishing bools from numbers in Python?
                            
                                difference between readlines() and split() [python]
                            
                                python: How to calculate the cosine similarity of two word lists?
                            
                                How to change the text of a span that acts like a button
                            
                                Numpy reshape on view
                            
                                What could cause numpy.nanstd() to return nan?
                            
                                How to use nosetests in python while also passing/accepting arguments for argparse?
                            
                                Conditional replacement of multiple columns based on column values in pandas DataFrame
                            
                                Find the end offset of a matched string or regex
                            
                                Linear algebra on python
                            
                                pandas df.corr() returns NaN despite data fed having populated data
                            
                                Is it possible to plot implicit 3d equation using sympy?
                            
                                In Python, what is the easiest way to add a list consisting of keyword pairs to a dictionary?
                            
                                What did I forget in order to correctly send an email using Scrapy
                            
                                How to enable @cache_page for some of the Django Rest Framework views?
                            
                                Rotated axis labels are placed incorrectly (matplotlib)
                            
                                Python Saving and Editing with Klepto
                            
                                Is there a good way to display sample size on grouped boxplots using Python Matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With