I have a file like this: <pre class="prettyprint"><code>This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. </code></pre> I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example: <pre class="prettyprint"><code>this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1 than@1 one@1 once@1 time@1 </code></pre> <ul> <li>To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.</li> <li>Unless there is a simple solution around it, <code>words</code> and <code>word</code> can count as two separate words.</li> </ul> So far, I have this: <pre class="prettyprint"><code>sed -i "s/ /\n/g" ./file1.txt # put all words on a new line while read line do count="$(grep -c $line file1.txt)" echo $line"@"$count >> file2.txt # add word and frequency to file done < ./file1.txt sort -u -d # remove duplicate lines </code></pre> For some reason, this is only showing "0" after each word. How can I generate a list of every word that appears in a file, along with frequency information?

Not <code>sed</code> and <code>grep</code>, but <code>tr</code>, <code>sort</code>, <code>uniq</code>, and <code>awk</code>: <pre class="prettyprint"><code>% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. EOF a@1 appear@2 file@1 is@1 many@1 more@1 of@2 once.@1 one@1 only@1 Some@2 than@1 the@2 This@1 time.@1 with@1 words@2 words.@1 </code></pre> In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command: <pre class="prettyprint"><code>sed -e 's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn </code></pre>

How to create a frequency list of every word in a file?

Tags:

grep

bash

file-io

sed

I have a file like this:

This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.

I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:

this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1 than@1 one@1 once@1 time@1

To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.
Unless there is a simple solution around it, words and word can count as two separate words.

So far, I have this:

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line while read line do      count="$(grep -c $line file1.txt)"      echo $line"@"$count >> file2.txt # add word and frequency to file done < ./file1.txt sort -u -d # remove duplicate lines

For some reason, this is only showing "0" after each word.

How can I generate a list of every word that appears in a file, along with frequency information?

605

asked May 11 '12 13:05

Village

1 Answers

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. EOF  a@1 appear@2 file@1 is@1 many@1 more@1 of@2 once.@1 one@1 only@1 Some@2 than@1 the@2 This@1 time.@1 with@1 words@2 words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

160

answered Sep 20 '22 15:09

eduffy

Related questions
                            
                                How do I check if variable is an array?
                            
                                Empty function in BASH
                            
                                Convert string to date in bash
                            
                                Replace a string with another string in all files below my current dir
                            
                                How to base64 encode /dev/random or /dev/urandom?
                            
                                Automatically chdir to vagrant directory upon "vagrant ssh"
                            
                                How to make a bash function which can read from standard input?
                            
                                Bash sleep in milliseconds
                            
                                Permission denied when trying to append a file to a root owned file with sudo [closed]
                            
                                How to turn off the pager for AWS CLI return value?
                            
                                How to get time since file was last modified in seconds with bash?
                            
                                Open files in existing Gvim in multiple (new) tabs
                            
                                Insert linefeed in sed (Mac OS X)
                            
                                Scripts for computing the average of a list of numbers in a data file
                            
                                Replace slash in Bash
                            
                                How do you send the output of ls to mv?
                            
                                Extraction of data from a simple XML file
                            
                                compare file's date bash
                            
                                Access MAMP's MySQL from Terminal
                            
                                Parallel processing from a command queue on Linux (bash, python, ruby... whatever)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With