Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to simulate "group by" from bash?

Tags:

bash

scripting

sort ip_addresses | uniq -c

This will print the count first, but other than that it should be exactly what you want.


The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.


for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )

cat file

US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000

awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file

US|A|3000
US|B|3000
US|C|3000
UK|1|9000

The canonical solution is the one mentioned by another respondent:

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.