I have a .xlsx file that I am opening with this code:
import pandas as pd
df = pd.read_excel(open('file.xlsx','rb'))
df['Description'].head
and I have the following result, which looks pretty good.
ID | Description
:----- | :-----------------------------
0 | Some Description with no hash
1 | Text with #one hash
2 | Text with #two #hashes
Now I want to create a new column, keeping only words started with #, like this one:
ID | Description | Only_Hash
:----- | :----------------------------- | :-----------------
0 | Some Description with no hash | Nan
1 | Text with #one hash | #one
2 | Text with #two #hashes | #two #hashes
I was able to count/separate lines with #:
descriptionWithHash = df['Description'].str.contains('#').sum()
but now I want to create the column like I described above. What is the easiest way to do that?
Regards!
PS: it is supposed to show a table format in the question but I can't figure out why it is showing wrong!
You can use str.findall
with str.join
:
df['new'] = df['Description'].str.findall('(\#\w+)').str.join(' ')
print(df)
ID Description new
0 0 Some Description with no hash
1 1 Text with #one hash #one
2 2 Text with #two #hashes #two #hashes
And for NaNs:
df['new'] = df['Description'].str.findall('(\#\w+)').str.join(' ').replace('',np.nan)
print(df)
ID Description new
0 0 Some Description with no hash NaN
1 1 Text with #one hash #one
2 2 Text with #two #hashes #two #hashes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With