Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building a distribution of IP's with counts

Tags:

awk

I am trying to get a more dtrace style distribution output when doing awks on large logfiles after a DDoS so that it is easier to read the output:

# tail -1000 access_log | awk '{ print $1 }' | sort | uniq -c | sort -nr | awk '{printf("\n%s ",$0) ; for (i = 0; i<$1 ; i++) {printf("*")};}'

  43 192.168.0.1 *******************************************
  38 192.168.0.2 **************************************

Hopefully it could look something like:

       value  ------------- Distribution ------------- count    
 192.168.0.1  @@@@@@@@@                                43 
 192.168.0.2  @@@@@@@@                                 38 

Where the @'s is a smaller summary of the count vs doing *'s for the number. Getting it to automatically scale per run would be an added bonus vs me having to do maths to figure out how to rank each count.

like image 451
Jacques Marneweck Avatar asked Apr 26 '11 22:04

Jacques Marneweck


People also ask

hOW do you distribute an IP address?

hOW ARE IP ADDRESSES DISTRIbuTED? IP addresses are distributed in a hierarchical system. As the operator of Internet Assigned Numbers Authority (IANA) functions, ICANN allocates IP address blocks to the five Regional Internet Registries (RIRs) around the world.

hOW are IP addresses grouped together?

The original specifications for TCP/IP grouped IP addresses into sets of consecutive addresses called IP networks. The addresses in a single IP network have the same numeric value in the first part of all addresses in the network.

hOW is total number of IP addresses calculated?

The total number of host addresses for a network is 2 to the power of the number of host bits, which is 32 (IPv4 address bits) minus the number of network bits. For example, for a /21 (network mask 255.255. 248.0 ) network, there are 11 host bits ( 32 address bits – 21 network bits = 11 host bits ).


1 Answers

Your pipeline is actually pretty good. You really just need it to scale large numbers. I replaced your tail -1000 access_log | awk '{ print $1 }' | with an unsorted file of ip numbers from one of my web servers. Added head -20 to just print the 20 most active ip addresses.

$  sort ip.txt | uniq -c | sort -nr | \
>  awk 'NR==1{scale=$1/50} \
>       {printf("\n%-23s ",$0) ; \
>        for (i = 0; i<($1/scale) ; i++) {
>            printf("*")}; \
>        }' | head -20

The important parts are

  • NR==1{scale=$1/50} to calculate the scaling factor to fit the maximum count into 50 characters, and
  • printf("\n%-23s ",$0) ; uses a width specifier %-23s to left-align the count and ip address within a 23 character space.

My output looks like this. I masked the IP addresses.

   824 xx.xxx.xx.39    **************************************************
   149 xx.xxx.xxx.176  **********
   138 xx.xxx.xxx.191  *********
   137 xx.xxx.xxx.41   *********
   105 xx.xxx.xxx.8    *******
    97 xx.xxx.xxx.21   ******
    96 xx.xxx.xx.220   ******
    91 xx.xx.xxx.198   ******
    87 xx.xxx.xxx.195  ******
    85 xx.xxx.xx.221   ******
    79 xxx.xxx.xxx.86  *****
    69 xx.xx.xx.12     *****
    68 xxx.xxx.xxx.159 *****
    65 xx.xxx.xxx.66   ****
    63 xx.xxx.xx.28    ****
    60 xx.xxx.xxx.104  ****
    59 xxx.xxx.xxx.242 ****
    59 xxx.xx.xxx.66   ****
    56 xx.xxx.xxx.202  ****

This kind of output has a human-factors problem. People judge graphs like these by the area of the lines (the asterisks). Since this display scales with the magnitude of the numbers, you can't visually compare two of these graphs with any reliability.

Your eyes and brain want you to judge the length of the lines. (I'm not sure where I learned this. Maybe from Tufte's books, or from studying statistics.) But the scaling might mean that the longest line on one graph represents 800, while an identical line on another graph might represent only 100. Your eyes and brain want to believe those two are roughly equal, even though one is eight times as big as the other, and even though you can see the raw numbers.

like image 63
Mike Sherrill 'Cat Recall' Avatar answered Oct 22 '22 14:10

Mike Sherrill 'Cat Recall'