Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract substring from text in a pandas DataFrame as new column

I have a list of 'words' I want to count below

word_list = ['one','three']

And I have a column within pandas dataframe with text below.

TEXT                                       |
-------------------------------------------|
"Perhaps she'll be the one for me."        |
"Is it two or one?"                        |
"Mayhaps it be three afterall..."          |
"Three times and it's a charm."            |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat."         |
"One does not simply code into pandas."    |
"Two nights later..."                      |
"Quoth the Raven... nevermore."            |

The desired output is the following below, where it keeps the original text column, but only extracted the words in word_list to a new column

TEXT                                       | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | one
"Is it two or one?"                        | one
"Mayhaps it be three afterall..."          | three
"Three times and it's a charm."            | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat."         | one
"One does not simply code into pandas."    | one
"Two nights later..."                      | 
"Quoth the Raven... nevermore."            |

Is there a way to do this in Python 2.7?

like image 290
Leggerless Avatar asked Oct 24 '17 23:10

Leggerless


People also ask

How do you replace a substring in a DataFrame column?

You can replace substring of pandas DataFrame column by using DataFrame. replace() method. This method by default finds the exact sting match and replaces it with the specified value. Use regex=True to replace substring.

How do I extract a word from a string in a pandas DataFrame?

extract() function is used to extract capture groups in the regex pat as columns in a DataFrame. For each subject string in the Series, extract groups from the first match of regular expression pat.


1 Answers

Use str.extract:

df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)), 
                        flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']

0      one
1      one
2    three
3    three
4      one
5      one
6      one
7         
8         
Name: EXTRACT, dtype: object

Each word in word_list is joined by the regex separator | and then passed to str.extract for regex pattern matching.

The re.IGNORECASE switch is turned on for case-insensitive comparisons, and the resultant matches are lowercased to match with your expected output.

like image 142
cs95 Avatar answered Sep 29 '22 14:09

cs95