How should I convert NaN value into categorical value based on condition. I am getting error while trying to convert Nan value.
category gender sub-category title
health&beauty NaN makeup lipbalm
health&beauty women makeup lipstick
NaN NaN NaN lipgloss
My DataFrame looks like this. And my function to convert NaN values in gender to categorical value looks like
def impute_gender(cols):
category=cols[0]
sub_category=cols[2]
gender=cols[1]
title=cols[3]
if title.str.contains('Lip') and gender.isnull==True:
return 'women'
df[['category','gender','sub_category','title']].apply(impute_gender,axis=1)
If I run the code I am getting error
----> 7 if title.str.contains('Lip') and gender.isnull()==True:
8 print(gender)
9
AttributeError: ("'str' object has no attribute 'str'", 'occurred at index category')
Complete Dataset -https://github.com/lakshmipriya04/py-sample
Or simply use loc as an option 3 to @COLDSPEED's answer
cond = (df['gender'].isnull()) & (df['title'].str.contains('lip'))
df.loc[cond, 'gender'] = 'women'
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
If we are due with NaN values , fillna
can be one of the method:-)
df.gender=df.gender.fillna(df.title.str.contains('lip').replace(True,'women'))
df
Out[63]:
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
Some things to note here -
apply
over 4 columns is wastefulapply
is wasteful and inefficient, because it is slow, uses a lot of memory, and offers no vectorisation benefits to you.str
accessor as you would a pd.Series
object. title.contains
would be enough. Or more pythonically, "lip" in title
.gender.isnull
sounds completely wrong to the interpreter because gender
is a scalar, it has no isnull
attributeOption 1np.where
m = df.gender.isnull() & df.title.str.contains('lip')
df['gender'] = np.where(m, 'women', df.gender)
df
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
Which is not only fast, but simpler as well. If you're worried about case sensitivity, you can make your contains
check case insensitive -
m = df.gender.isnull() & df.title.str.contains('lip', flags=re.IGNORECASE)
Option 2
Another alternative is using pd.Series.mask
/pd.Series.where
-
df['gender'] = df.gender.mask(m, 'women')
Or,
df['gender'] = df.gender.where(~m, 'women')
<!- ->
df
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
The mask
implicitly applies the new value to the column based on the mask provided.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With