I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example, <pre class="prettyprint"><code>Red Ball 1 Sold Blue Bat 5 OnSale ............... </code></pre> So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column. I need to do this in a Linux command line, so probably using some bash script, sed, awk or something. What if I wanted a count of these unique values as well? Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

You can make use of <code>cut</code>, <code>sort</code> and <code>uniq</code> commands as follows: <pre class="prettyprint"><code>cat input_file | cut -f 1 | sort | uniq </code></pre> gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2. Avoiding UUOC :) <pre class="prettyprint"><code>cut -f 1 input_file | sort | uniq </code></pre> EDIT: To count the number of unique occurences you can make use of <code>wc</code> command in the chain as: <pre class="prettyprint"><code>cut -f 1 input_file | sort | uniq | wc -l </code></pre>

<pre class="prettyprint"><code>awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv </code></pre>

How to count number of unique values of a field in a tab-delimited text file?

Tags:

linux

bash

command-line

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,

Red     Ball 1 Sold Blue    Bat  5 OnSale ...............

So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.

I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.

What if I wanted a count of these unique values as well?

Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

506

asked Aug 17 '10 12:08

sfactor

2 Answers

You can make use of cut, sort and uniq commands as follows:

cat input_file | cut -f 1 | sort | uniq

gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.

Avoiding UUOC :)

cut -f 1 input_file | sort | uniq

EDIT:

To count the number of unique occurences you can make use of wc command in the chain as:

cut -f 1 input_file | sort | uniq | wc -l

133

answered Sep 30 '22 12:09

codaddict

awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv

answered Sep 30 '22 11:09

Mike

Related questions
                            
                                Why fork() before setsid()
                            
                                Bash - how to put each line within quotation
                            
                                Running a self-contained ASP.NET Core application on Ubuntu
                            
                                given two directory trees how to find which files are the same?
                            
                                Differences between CHMOD 755 vs 750 permissions set
                            
                                How to compile dynamic library for a JNI application on linux?
                            
                                How can I parse CSV files on the Linux command line? [closed]
                            
                                Cannot create backup file(add ! to overwrite)
                            
                                How to store the result of an executed shell command in a variable in python? [duplicate]
                            
                                Linux process in background - "Stopped" in jobs?
                            
                                How do I install chkconfig on Ubuntu?
                            
                                Redirecting TCP-traffic to a UNIX domain socket under Linux
                            
                                Is it possible to recursively create folders using a shell script?
                            
                                How to expand relative paths in shell script
                            
                                Proper way to reference files relative to application root in Node.JS
                            
                                How to find my php-fpm.sock?
                            
                                Parallel make: set -j8 as the default option
                            
                                sed: -i may not be used with stdin on Mac OS X
                            
                                Why do I get a "conflicting types for getline" error when compiling the longest line example in chapter 1 of K&R2?
                            
                                Find which assembly instruction caused an Illegal Instruction error without debugging

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With