Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting the Frequency of words in a pandas data frame

I have a table like below:

      URN                   Firm_Name 0  104472               R.X. Yah & Co 1  104873        Big Building Society 2  109986          St James's Society 3  114058  The Kensington Society Ltd 4  113438      MMV Oil Associates Ltd 

And I want to count the frequency of all the words within the Firm_Name column, to get an output like below:

enter image description here

I have tried the following code:

import pandas as pd import nltk data = pd.read_csv("X:\Firm_Data.csv") top_N = 20 word_dist = nltk.FreqDist(data['Firm_Name']) print('All frequencies') print('='*60) rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])  print(rslt) print ('='*60) 

However the following code does not produce a unique word count.

like image 840
J Reza Avatar asked Oct 17 '17 08:10

J Reza


People also ask

How do you count the frequency of a word in a pandas DataFrame?

Using the count(), size() method, Series. value_counts(), and pandas. Index. value_counts() method we can count the number of frequency of itemsets in the given DataFrame.

How do you count occurrences of a string in pandas?

count() function is used to count occurrences of pattern in each string of the Series/Index. This function is used to count the number of times a particular regex pattern is repeated in each of the string elements of the Series. Valid regular expression. For compatibility with other string methods.

How do you find the frequency of a word in Python?

Use set() method to remove a duplicate and to give a set of unique words. Iterate over the set and use count function (i.e. string. count(newstring[iteration])) to find the frequency of word at each iteration.


2 Answers

IIUIC, use value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts() Out[3361]: Society       3 Ltd           2 James's       1 R.X.          1 Yah           1 Associates    1 St            1 Kensington    1 MMV           1 Big           1 &             1 The           1 Co            1 Oil           1 Building      1 dtype: int64 

Or,

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts() 

Or,

pd.Series(' '.join(df.Firm_Name).split()).value_counts() 

For top N, for example 3

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3] Out[3379]: Society    3 Ltd        2 James's    1 dtype: int64 

Details

In [3380]: df Out[3380]:       URN                   Firm_Name 0  104472               R.X. Yah & Co 1  104873        Big Building Society 2  109986          St James's Society 3  114058  The Kensington Society Ltd 4  113438      MMV Oil Associates Ltd 
like image 168
Zero Avatar answered Sep 26 '22 05:09

Zero


You need str.cat with lower first for concanecate all values to one string, then need word_tokenize and last use your solution:

top_N = 4 #if not necessary all lower a = data['Firm_Name'].str.lower().str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) print (word_dist) <FreqDist with 17 samples and 20 outcomes>  rslt = pd.DataFrame(word_dist.most_common(top_N),                     columns=['Word', 'Frequency']) print(rslt)       Word  Frequency 0  society          3 1      ltd          2 2      the          1 3       co          1 

Also is possible remove lower if necessary:

top_N = 4 a = data['Firm_Name'].str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) rslt = pd.DataFrame(word_dist.most_common(top_N),                     columns=['Word', 'Frequency']) print(rslt)          Word  Frequency 0     Society          3 1         Ltd          2 2         MMV          1 3  Kensington          1 
like image 26
jezrael Avatar answered Sep 23 '22 05:09

jezrael