Suppose I have a dataframe,
data
id URL
1 www.pandora.com
2 m.jcpenney.com
3 www.youtube.com
4 www.facebook.com
I want to create a new column based on a condition that if the URL contains some particular word. Suppose if it contains 'youtube', I want my column value as youtube. So I tried the following,
data['test'] = 'other'
so once we do that we have,
data['test']
other
other
other
other
then I tried this,
data[data['URL'].str.contains("youtub") == True]['test'] = 'Youtube'
data[data['URL'].str.contains("face") == True]['test'] = 'Facebook'
Though this runs without any error, the value of the test column, doesn't change. It still has other only for all the columns. When I run this statement, ideally 3rd row alone show change to 'Youtube' and 4th to 'Facebook'. But it doesn't change. Can anybody tell me what mistake I am doing here?
I think you can use loc
with boolean mask created by contains
:
print data['URL'].str.contains("youtub")
0 False
1 False
2 True
3 False
Name: URL, dtype: bool
data.loc[data['URL'].str.contains("youtub"),'test'] = 'Youtube'
data.loc[data['URL'].str.contains("face"),'test'] = 'Facebook'
print data
id URL test
0 1 www.pandora.com NaN
1 2 m.jcpenney.com NaN
2 3 www.youtube.com Youtube
3 4 www.facebook.com Facebook
i would do it in one shot:
replacements = {
r'.*youtube.*': 'Youtube',
r'.*face.*': 'Facebook',
r'.*pandora.*': 'Pandora'
}
df['text'] = df.URL.replace(replacements, regex=True)
df.loc[df.text.str.contains('\.'), 'text'] = 'other'
print(df)
Output:
URL text
id
1 www.pandora.com Pandora
2 m.jcpenney.com other
3 www.youtube.com Youtube
4 www.facebook.com Facebook
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With