Extract substring from text in a pandas DataFrame as new column

Tags:

I have a list of 'words' I want to count below

word_list = ['one','three']

And I have a column within pandas dataframe with text below.

TEXT                                       |
-------------------------------------------|
"Perhaps she'll be the one for me."        |
"Is it two or one?"                        |
"Mayhaps it be three afterall..."          |
"Three times and it's a charm."            |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat."         |
"One does not simply code into pandas."    |
"Two nights later..."                      |
"Quoth the Raven... nevermore."            |

The desired output is the following below, where it keeps the original text column, but only extracted the words in word_list to a new column

TEXT                                       | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | one
"Is it two or one?"                        | one
"Mayhaps it be three afterall..."          | three
"Three times and it's a charm."            | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat."         | one
"One does not simply code into pandas."    | one
"Two nights later..."                      | 
"Quoth the Raven... nevermore."            |

Is there a way to do this in Python 2.7?

290

asked Oct 24 '17 23:10

Leggerless

1 Answers

Use str.extract:

df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)), 
                        flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']

0      one
1      one
2    three
3    three
4      one
5      one
6      one
7         
8         
Name: EXTRACT, dtype: object

Each word in word_list is joined by the regex separator | and then passed to str.extract for regex pattern matching.

The re.IGNORECASE switch is turned on for case-insensitive comparisons, and the resultant matches are lowercased to match with your expected output.

142

answered Sep 29 '22 14:09

cs95

Related questions
                            
                                docker - using python image, add the non-free Debian repo?
                            
                                Implementing the collatz function using Python
                            
                                How to generate a DOCX in Python and save it in memory?
                            
                                Error when connecting to redshift: "server certificate does not match host name"
                            
                                "implicit uses of special methods always rely on the class-level binding of the special method"
                            
                                How to write pandas dataframe to csv/xls on FTP directly
                            
                                How do I suppress tracebacks in Jupyter?
                            
                                PyQT5: Grid layout inside horizontal layout
                            
                                Reading a part of csv file
                            
                                Pass arguments to python from bash script
                            
                                How to validate a ReCaptcha response server side with Python?
                            
                                How to convert the arff object loaded from a .arff file into a dataframe format?
                            
                                Reading CSV files in a loop using pandas, then concatenating them
                            
                                Plot hyperplane Linear SVM python
                            
                                Iterate through two lists of different lengths
                            
                                In py.test, what's the point of marking fixture as fixture?
                            
                                Python - replace every nth occurrence of string
                            
                                Python shorthand for .format [duplicate]
                            
                                Tensorflow Estimator API save image summary in eval mode
                            
                                Multiple lines on line plot/time series with matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract substring from text in a pandas DataFrame as new column

Tags:

python

string

regex

pandas

extract

Leggerless

People also ask

1 Answers

cs95

Recent Activity

Donate For Us