Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unix uniq command to CSV file

Tags:

bash

unix

csv

uniq

I have a text file (list.txt) containing single and multi-word English phrases. My goal is to do a word count for each word and write the results to a CSV file.

I have figured out the command to write the amount of unique instances of each word, sorted from largest to smallest. That command is:

$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less > output.txt

The problem is the way the new file (output.txt) is formatted. There are 3 leading spaces, followed by the number of occurrences, followed by a space, followed by the word. Then on to a next line. Example:

   9784 the
   6368 and
   4211 for
   2929 to

What would I need to do in order to get the results in a more desired format, such as CSV? For example, I'd like it to be:

9784,the
6368,and
4211,for
2929,to

Even better would be:

the,9784
and,6368
for,4211
to,2929

Is there a way to do this with a Unix command, or do I need to do some post-processing within a text editor or Excel?

like image 381
Abundnce10 Avatar asked Mar 11 '13 18:03

Abundnce10


People also ask

How create csv file in Unix?

To create a CSV file with a text editor, first choose your favorite text editor, such as Notepad or vim, and open a new file. Then enter the text data you want the file to contain, separating each value with a comma and each row with a new line. Save this file with the extension . csv.

What does the uniq command do in Unix?

The uniq command can count and print the number of repeated lines. Just like duplicate lines, we can filter unique lines (non-duplicate lines) as well and can also ignore case sensitivity. We can skip fields and characters before comparing duplicate lines and also consider characters for filtering lines.

How do I find unique records in Unix?

The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields.


1 Answers

Use awk as follows:

 > cat input 
   9784 the
   6368 and
   4211 for
   2929 to
 > cat input | awk '{ print $2 "," $1}'
the,9784
and,6368
for,4211
to,2929

You full pipeline will be:

$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | awk '{ print $2 "," $1}' > output.txt
like image 147
Andrew Stein Avatar answered Oct 02 '22 17:10

Andrew Stein