How should I convert NaN value into categorical value based on condition. I am getting error while trying to convert Nan value.
category           gender     sub-category    title
health&beauty      NaN         makeup         lipbalm
health&beauty      women       makeup         lipstick
NaN                NaN         NaN            lipgloss
My DataFrame looks like this. And my function to convert NaN values in gender to categorical value looks like
def impute_gender(cols):
    category=cols[0]
    sub_category=cols[2]
    gender=cols[1]
    title=cols[3]
    if title.str.contains('Lip') and gender.isnull==True:
        return 'women'
df[['category','gender','sub_category','title']].apply(impute_gender,axis=1)
If I run the code I am getting error
----> 7     if title.str.contains('Lip') and gender.isnull()==True:
      8         print(gender)
      9 
AttributeError: ("'str' object has no attribute 'str'", 'occurred at index category')
Complete Dataset -https://github.com/lakshmipriya04/py-sample
Or simply use loc as an option 3 to @COLDSPEED's answer
cond = (df['gender'].isnull()) & (df['title'].str.contains('lip'))
df.loc[cond, 'gender'] = 'women'
    category        gender  sub-category    title
0   health&beauty   women   makeup          lipbalm
1   health&beauty   women   makeup          lipstick
2   NaN             women       NaN         lipgloss
                        If we are due with NaN values , fillna can be one of the method:-)
df.gender=df.gender.fillna(df.title.str.contains('lip').replace(True,'women'))
df
Out[63]: 
        category gender sub-category     title
0  health&beauty  women       makeup   lipbalm
1  health&beauty  women       makeup  lipstick
2            NaN  women          NaN  lipgloss
                        Some things to note here -
apply over 4 columns is wastefulapply is wasteful and inefficient, because it is slow, uses a lot of memory, and offers no vectorisation benefits to you.str accessor as you would a pd.Series object. title.contains would be enough. Or more pythonically, "lip" in title.gender.isnull sounds completely wrong to the interpreter because gender is a scalar, it has no isnull attributeOption 1np.where
m = df.gender.isnull() & df.title.str.contains('lip')
df['gender'] = np.where(m, 'women', df.gender)
df
        category gender sub-category     title
0  health&beauty  women       makeup   lipbalm
1  health&beauty  women       makeup  lipstick
2            NaN  women          NaN  lipgloss
Which is not only fast, but simpler as well. If you're worried about case sensitivity, you can make your contains check case insensitive -
m = df.gender.isnull() & df.title.str.contains('lip', flags=re.IGNORECASE)
Option 2
Another alternative is using pd.Series.mask/pd.Series.where -
df['gender'] = df.gender.mask(m, 'women')
Or,
df['gender'] = df.gender.where(~m, 'women')
<!- ->
df
        category gender sub-category     title
0  health&beauty  women       makeup   lipbalm
1  health&beauty  women       makeup  lipstick
2            NaN  women          NaN  lipgloss
The mask implicitly applies the new value to the column based on the mask provided.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With