I have a table like below: <pre class="prettyprint"><code> URN Firm_Name 0 104472 R.X. Yah & Co 1 104873 Big Building Society 2 109986 St James's Society 3 114058 The Kensington Society Ltd 4 113438 MMV Oil Associates Ltd </code></pre> And I want to count the frequency of all the words within the Firm_Name column, to get an output like below: <img src="https://i.stack.imgur.com/TbW0I.png" alt="enter image description here"> I have tried the following code: <pre class="prettyprint"><code>import pandas as pd import nltk data = pd.read_csv("X:\Firm_Data.csv") top_N = 20 word_dist = nltk.FreqDist(data['Firm_Name']) print('All frequencies') print('='*60) rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency']) print(rslt) print ('='*60) </code></pre> However the following code does not produce a unique word count.

IIUIC, use <code>value_counts()</code> <pre class="prettyprint"><code>In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts() Out[3361]: Society 3 Ltd 2 James's 1 R.X. 1 Yah 1 Associates 1 St 1 Kensington 1 MMV 1 Big 1 & 1 The 1 Co 1 Oil 1 Building 1 dtype: int64 </code></pre> <hr> Or, <pre class="prettyprint"><code>pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts() </code></pre> <hr> Or, <pre class="prettyprint"><code>pd.Series(' '.join(df.Firm_Name).split()).value_counts() </code></pre> <hr> For top N, for example 3 <pre class="prettyprint"><code>In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3] Out[3379]: Society 3 Ltd 2 James's 1 dtype: int64 </code></pre> <hr> Details <pre class="prettyprint"><code>In [3380]: df Out[3380]: URN Firm_Name 0 104472 R.X. Yah & Co 1 104873 Big Building Society 2 109986 St James's Society 3 114058 The Kensington Society Ltd 4 113438 MMV Oil Associates Ltd </code></pre>

Counting the Frequency of words in a pandas data frame

Tags:

python

pandas

nltk

I have a table like below:

      URN                   Firm_Name 0  104472               R.X. Yah & Co 1  104873        Big Building Society 2  109986          St James's Society 3  114058  The Kensington Society Ltd 4  113438      MMV Oil Associates Ltd

And I want to count the frequency of all the words within the Firm_Name column, to get an output like below:

enter image description here

I have tried the following code:

import pandas as pd import nltk data = pd.read_csv("X:\Firm_Data.csv") top_N = 20 word_dist = nltk.FreqDist(data['Firm_Name']) print('All frequencies') print('='*60) rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])  print(rslt) print ('='*60)

However the following code does not produce a unique word count.

840

asked Oct 17 '17 08:10

J Reza

2 Answers

IIUIC, use value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts() Out[3361]: Society       3 Ltd           2 James's       1 R.X.          1 Yah           1 Associates    1 St            1 Kensington    1 MMV           1 Big           1 &             1 The           1 Co            1 Oil           1 Building      1 dtype: int64

Or,

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()

Or,

pd.Series(' '.join(df.Firm_Name).split()).value_counts()

For top N, for example 3

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3] Out[3379]: Society    3 Ltd        2 James's    1 dtype: int64

Details

In [3380]: df Out[3380]:       URN                   Firm_Name 0  104472               R.X. Yah & Co 1  104873        Big Building Society 2  109986          St James's Society 3  114058  The Kensington Society Ltd 4  113438      MMV Oil Associates Ltd

168

answered Sep 26 '22 05:09

Zero

You need str.cat with lower first for concanecate all values to one string, then need word_tokenize and last use your solution:

top_N = 4 #if not necessary all lower a = data['Firm_Name'].str.lower().str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) print (word_dist) <FreqDist with 17 samples and 20 outcomes>  rslt = pd.DataFrame(word_dist.most_common(top_N),                     columns=['Word', 'Frequency']) print(rslt)       Word  Frequency 0  society          3 1      ltd          2 2      the          1 3       co          1

Also is possible remove lower if necessary:

top_N = 4 a = data['Firm_Name'].str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) rslt = pd.DataFrame(word_dist.most_common(top_N),                     columns=['Word', 'Frequency']) print(rslt)          Word  Frequency 0     Society          3 1         Ltd          2 2         MMV          1 3  Kensington          1

answered Sep 23 '22 05:09

jezrael

Related questions
                            
                                Pandas: Multilevel column names
                            
                                Add column sum as new column in PySpark dataframe
                            
                                How to calculate the counts of each distinct value in a pyspark dataframe?
                            
                                Python unittest - Ran 0 tests in 0.000s
                            
                                ModuleNotFoundError: No module named 'numpy.testing.nosetester'
                            
                                Removing unicode \u2026 like characters in a string in python2.7 [duplicate]
                            
                                Flask-RESTful API: multiple and complex endpoints
                            
                                Count number of non-NaN entries in each column of Spark dataframe with Pyspark
                            
                                Get max key in dictionary
                            
                                Python 3, Are there any known security holes in ast.literal_eval(node_or_string)?
                            
                                python split a string with at least 2 whitespaces
                            
                                Celery: How to ignore task result in chord or chain?
                            
                                Find index where elements change value numpy
                            
                                converting a list of integers into range in python
                            
                                How to make List from Numpy Matrix in Python
                            
                                pyqt: how to remove a widget?
                            
                                Using mock to patch a celery task in Django unit tests
                            
                                How to append in a json file in Python?
                            
                                Reverse string: string[::-1] works, but string[0::-1] and others don't
                            
                                Test if dict contained in dict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With