I have a table like below:
URN Firm_Name 0 104472 R.X. Yah & Co 1 104873 Big Building Society 2 109986 St James's Society 3 114058 The Kensington Society Ltd 4 113438 MMV Oil Associates Ltd
And I want to count the frequency of all the words within the Firm_Name column, to get an output like below:
I have tried the following code:
import pandas as pd import nltk data = pd.read_csv("X:\Firm_Data.csv") top_N = 20 word_dist = nltk.FreqDist(data['Firm_Name']) print('All frequencies') print('='*60) rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency']) print(rslt) print ('='*60)
However the following code does not produce a unique word count.
Using the count(), size() method, Series. value_counts(), and pandas. Index. value_counts() method we can count the number of frequency of itemsets in the given DataFrame.
count() function is used to count occurrences of pattern in each string of the Series/Index. This function is used to count the number of times a particular regex pattern is repeated in each of the string elements of the Series. Valid regular expression. For compatibility with other string methods.
Use set() method to remove a duplicate and to give a set of unique words. Iterate over the set and use count function (i.e. string. count(newstring[iteration])) to find the frequency of word at each iteration.
IIUIC, use value_counts()
In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts() Out[3361]: Society 3 Ltd 2 James's 1 R.X. 1 Yah 1 Associates 1 St 1 Kensington 1 MMV 1 Big 1 & 1 The 1 Co 1 Oil 1 Building 1 dtype: int64
Or,
pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
Or,
pd.Series(' '.join(df.Firm_Name).split()).value_counts()
For top N, for example 3
In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3] Out[3379]: Society 3 Ltd 2 James's 1 dtype: int64
Details
In [3380]: df Out[3380]: URN Firm_Name 0 104472 R.X. Yah & Co 1 104873 Big Building Society 2 109986 St James's Society 3 114058 The Kensington Society Ltd 4 113438 MMV Oil Associates Ltd
You need str.cat
with lower
first for concanecate all values to one string
, then need word_tokenize
and last use your solution:
top_N = 4 #if not necessary all lower a = data['Firm_Name'].str.lower().str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) print (word_dist) <FreqDist with 17 samples and 20 outcomes> rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency']) print(rslt) Word Frequency 0 society 3 1 ltd 2 2 the 1 3 co 1
Also is possible remove lower
if necessary:
top_N = 4 a = data['Firm_Name'].str.cat(sep=' ') words = nltk.tokenize.word_tokenize(a) word_dist = nltk.FreqDist(words) rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency']) print(rslt) Word Frequency 0 Society 3 1 Ltd 2 2 MMV 1 3 Kensington 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With