Building a distribution of IP's with counts

Tags:

awk

I am trying to get a more dtrace style distribution output when doing awks on large logfiles after a DDoS so that it is easier to read the output:

# tail -1000 access_log | awk '{ print $1 }' | sort | uniq -c | sort -nr | awk '{printf("\n%s ",$0) ; for (i = 0; i<$1 ; i++) {printf("*")};}'

  43 192.168.0.1 *******************************************
  38 192.168.0.2 **************************************

Hopefully it could look something like:

       value  ------------- Distribution ------------- count    
 192.168.0.1  @@@@@@@@@                                43 
 192.168.0.2  @@@@@@@@                                 38

Where the @'s is a smaller summary of the count vs doing *'s for the number. Getting it to automatically scale per run would be an added bonus vs me having to do maths to figure out how to rank each count.

451

asked Apr 26 '11 22:04

Jacques Marneweck

1 Answers

Your pipeline is actually pretty good. You really just need it to scale large numbers. I replaced your tail -1000 access_log | awk '{ print $1 }' | with an unsorted file of ip numbers from one of my web servers. Added head -20 to just print the 20 most active ip addresses.

$  sort ip.txt | uniq -c | sort -nr | \
>  awk 'NR==1{scale=$1/50} \
>       {printf("\n%-23s ",$0) ; \
>        for (i = 0; i<($1/scale) ; i++) {
>            printf("*")}; \
>        }' | head -20

The important parts are

NR==1{scale=$1/50} to calculate the scaling factor to fit the maximum count into 50 characters, and
printf("\n%-23s ",$0) ; uses a width specifier %-23s to left-align the count and ip address within a 23 character space.

My output looks like this. I masked the IP addresses.

   824 xx.xxx.xx.39    **************************************************
   149 xx.xxx.xxx.176  **********
   138 xx.xxx.xxx.191  *********
   137 xx.xxx.xxx.41   *********
   105 xx.xxx.xxx.8    *******
    97 xx.xxx.xxx.21   ******
    96 xx.xxx.xx.220   ******
    91 xx.xx.xxx.198   ******
    87 xx.xxx.xxx.195  ******
    85 xx.xxx.xx.221   ******
    79 xxx.xxx.xxx.86  *****
    69 xx.xx.xx.12     *****
    68 xxx.xxx.xxx.159 *****
    65 xx.xxx.xxx.66   ****
    63 xx.xxx.xx.28    ****
    60 xx.xxx.xxx.104  ****
    59 xxx.xxx.xxx.242 ****
    59 xxx.xx.xxx.66   ****
    56 xx.xxx.xxx.202  ****

This kind of output has a human-factors problem. People judge graphs like these by the area of the lines (the asterisks). Since this display scales with the magnitude of the numbers, you can't visually compare two of these graphs with any reliability.

Your eyes and brain want you to judge the length of the lines. (I'm not sure where I learned this. Maybe from Tufte's books, or from studying statistics.) But the scaling might mean that the longest line on one graph represents 800, while an identical line on another graph might represent only 100. Your eyes and brain want to believe those two are roughly equal, even though one is eight times as big as the other, and even though you can see the raw numbers.

answered Oct 22 '22 14:10

Mike Sherrill 'Cat Recall'

Related questions
                            
                                Split a string directly into array
                            
                                convert exponential to decimal in python
                            
                                What is Perl's equivalent to awk's /text/,/END/?
                            
                                Read a large file and output sections matching multiple parameters
                            
                                In shell, how do I delete numbered duplicate files?
                            
                                Get regions from a file that are part of regions in other file (Without loops)
                            
                                Linux top -b show specific columns only [closed]
                            
                                Sort text columns by number of lines in bash
                            
                                Joining Line Breaks in FASTA file With Condition in SED/AWK/Perl one-liner
                            
                                sum column based on two matching fields using awk
                            
                                Shell script to replace a variable in a HTML document
                            
                                Stripping blocks of text from huge text file
                            
                                Python as "perl -pe", execute Python command for every line in stdin [duplicate]
                            
                                Quickest way to split a large file based on text within the file in linux
                            
                                How can I get a range of line every nth interval using awk, sed, or other unix command?
                            
                                Calculating percentages in arbitrary number of columns
                            
                                BASH: Joining 2 CSV files based on common field name
                            
                                Delete Row After Counting Number of Columns In Text File
                            
                                How to skip a directory in awk?
                            
                                Is there an Awk- or Lisp-like programming language that can process a stream of s-expressions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With