Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a frequency list of every word in a file?

I have a file like this:

This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. 

I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:

this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1 than@1 one@1 once@1 time@1  
  • To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.
  • Unless there is a simple solution around it, words and word can count as two separate words.

So far, I have this:

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line while read line do      count="$(grep -c $line file1.txt)"      echo $line"@"$count >> file2.txt # add word and frequency to file done < ./file1.txt sort -u -d # remove duplicate lines 

For some reason, this is only showing "0" after each word.

How can I generate a list of every word that appears in a file, along with frequency information?

like image 605
Village Avatar asked May 11 '12 13:05

Village


People also ask

What is word frequency count?

Count the number of times you use everyword word in a text.

How do you count the frequency of a word in Python?

Use set() method to remove a duplicate and to give a set of unique words. Iterate over the set and use count function (i.e. string. count(newstring[iteration])) to find the frequency of word at each iteration.


1 Answers

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. EOF  a@1 appear@2 file@1 is@1 many@1 more@1 of@2 once.@1 one@1 only@1 Some@2 than@1 the@2 This@1 time.@1 with@1 words@2 words.@1 

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn 
like image 160
eduffy Avatar answered Sep 20 '22 15:09

eduffy