Conditional word frequency count in Pandas

Tags:

I have a dataframe like below:

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

I want to count the number of words in the speech column but only for the words from a pre-defined list. For example, the list is:

wordlist = ['much', 'good','right']

I want to generate a new column which shows the frequency of these three words in each row. My expected output is therefore:

     speaker                   speech                               words
0   Adam          Thank you very much and good afternoon.             2
1   Ben        Let me clarify that because I want to make sur...      1
2   Clair        By now you should have received a copy of our ...    1

I tried:

df['total'] = 0
for word in df['speech'].str.split():
    if word in wordlist: 
        df['total'] += 1

But I after running it, the total column is always zero. I am wondering what's wrong with my code?

327

asked Jan 30 '20 15:01

Tao Han

Video Answer

2 Answers

You could use the following vectorised approach:

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good','right']

df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))

Which gives:

>>> df
  speaker                                             speech  total
0    Adam            Thank you very much and good afternoon.      2
1     Ben  Let me clarify that because I want to make sur...      1
2   Clair              By now you should have some good rest      1

answered Oct 22 '22 17:10

CDJB

This is a much faster (runtime wise) solution, if you have a very large list and a large data frame to search through.

I guess it is because it takes advantage of a dictionary (which takes O(N) to construct and O(1) to search through). Performance wise, regex search is slower.

import pandas as pd
from collections import Counter

def occurrence_counter(target_string, search_list):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in search_list:
        if key in data:
            count+=data[key]
    return count

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good','right']

df['speech'].apply(lambda x: occurrence_counter(x, wordlist))

answered Oct 22 '22 17:10

Jinhua Wang

Related questions
                            
                                Filtering rows from dataframe based on the values of the previous rows
                            
                                Plotting multiple columns in a pandas line graph [duplicate]
                            
                                How to change the datatype of column in dask dataframe?
                            
                                How to setup Visual Studio Code stdin/stdout redirection for Python (debugger)?
                            
                                Get pandas.read_csv to read empty fields as NaN, and empty strings as empty strings
                            
                                tf.cast equivalent in pytorch?
                            
                                Why do very large Fibonacci numbers create an ellipse-type shape?
                            
                                Annoting date on chart
                            
                                TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [bool, float32] that don't all match
                            
                                How to change the font-size of text in dataframe using pandas
                            
                                Change a way fixtures are called in pytest
                            
                                What is different approach to my problem?
                            
                                How exclude !Ref tag from check-yaml git hook?
                            
                                Upgrading SQLite in Colab
                            
                                Altair bar chart with bars of variable width?
                            
                                How to initiate a Chromium based Vivaldi browser session using Selenium and Python
                            
                                Python moving file on SFTP server to another folder [duplicate]
                            
                                Telegram bot API is the chat_id unique for each user contacting the bot?
                            
                                VotingClassifier with pipelines as estimators
                            
                                Creating a button in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Conditional word frequency count in Pandas

Tags:

python

string

pandas

dataframe

nlp