I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold Blue Bat 5 OnSale ...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
To count distinct values in excel, first enter the formula =SUM(1/COUNTIF(range, range)) in the desired cell. The range specifies the starting cell and ending cell separated by a colon. This is an array function, so press Ctrl+Shift+Enter to apply the formula.
To get a count of unique values in a column use pandas, first use Series. unique() function to get unique values from column by removing duplidate values and then call the size to get the count. unique() function returns a ndarray with unique value in order of appearance and the results are not sorted.
You can make use of cut
, sort
and uniq
commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc
command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With