Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating histograms in bash

EDIT

I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.

END EDIT

QUESTION-

I have a long column of data with values between 0 and 1. This will be of the type-

0.34
0.45
0.44
0.12
0.45
0.98
.
.
.

A long column of decimal values with repetitions allowed.

I'm trying to change it into a histogram sort of output such as (for the input shown above)-

0.0-0.1  0
0.1-0.2  1
0.2-0.3  0
0.3-0.4  1 
0.4-0.5  3
0.5-0.6  0
0.6-0.7  0
0.7-0.8  0
0.8-0.9  0
0.9-1.0  1

Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.

I wrote it (badly) as-

for i in $(seq 0 0.1 0.9)
do 
    awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l; 
done

Which basically does a wc -l of the entries it finds in each range.

Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.

I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?

like image 687
Chem-man17 Avatar asked Sep 21 '16 10:09

Chem-man17


1 Answers

Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture). enter image description here

The script is the following:

#!/usr/bin/awk -f

BEGIN{
    bin_width=0.1;
    
}
{
    bin=int(($1-0.0001)/bin_width);
    if( bin in hist){
        hist[bin]+=1
    }else{
        hist[bin]=1
    }
}
END{
    for (h in hist)
        printf " * > %2.2f  ->  %i \n", h*bin_width, hist[h]
}
   

The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.

like image 193
Riccardo Petraglia Avatar answered Sep 20 '22 03:09

Riccardo Petraglia