I have a text file (list.txt) containing single and multi-word English phrases. My goal is to do a word count for each word and write the results to a CSV file.
I have figured out the command to write the amount of unique instances of each word, sorted from largest to smallest. That command is:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less > output.txt
The problem is the way the new file (output.txt) is formatted. There are 3 leading spaces, followed by the number of occurrences, followed by a space, followed by the word. Then on to a next line. Example:
9784 the
6368 and
4211 for
2929 to
What would I need to do in order to get the results in a more desired format, such as CSV? For example, I'd like it to be:
9784,the
6368,and
4211,for
2929,to
Even better would be:
the,9784
and,6368
for,4211
to,2929
Is there a way to do this with a Unix command, or do I need to do some post-processing within a text editor or Excel?
To create a CSV file with a text editor, first choose your favorite text editor, such as Notepad or vim, and open a new file. Then enter the text data you want the file to contain, separating each value with a comma and each row with a new line. Save this file with the extension . csv.
The uniq command can count and print the number of repeated lines. Just like duplicate lines, we can filter unique lines (non-duplicate lines) as well and can also ignore case sensitivity. We can skip fields and characters before comparing duplicate lines and also consider characters for filtering lines.
The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields.
Use awk
as follows:
> cat input
9784 the
6368 and
4211 for
2929 to
> cat input | awk '{ print $2 "," $1}'
the,9784
and,6368
for,4211
to,2929
You full pipeline will be:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | awk '{ print $2 "," $1}' > output.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With