Computing Shannon entropy of a HTTP header using Python. How to do it?

Tags:

The Shannon entropy is:

SHannon

\r\n\r\n is the end of a HTPP header:

enter image description here

Incomplete HTTP header:

Incomplete HTTP header

I have a network dump in PCAP format (dump.pcap) and I am trying to compute the entropy of the number of packets in HTTP protocol with \r\n\r\n and without \r\n\r\n in the header using Python and compare them. I read the packets using:

import pyshark

pkts = pyshark.FileCapture('dump.pcap')

I think Ti in shannon formula is the data of my dump file.

dump.pcap: https://uploadfiles.io/y5c7k

I already computed the entropy of IP numbers:

import numpy as np
import collections

sample_ips = [
    "131.084.001.031",
    "131.084.001.031",
    "131.284.001.031",
    "131.284.001.031",
    "131.284.001.000",
]

C = collections.Counter(sample_ips)
counts = np.array(list(C.values()),dtype=float)
#counts  = np.array(C.values(),dtype=float)
prob    = counts/counts.sum()
shannon_entropy = (-prob*np.log2(prob)).sum()
print (shannon_entropy)

Any idea? Is it possible to compute the entropy of the number of packets in HTTP protocol with \r\n\r\n and without \r\n\r\n in the header or it is a nonsense idea?

A few lines of the dump:

HTTP filter of wireshark

 30 2017/246 11:20:00.304515    192.168.1.18    192.168.1.216   HTTP    339 GET / HTTP/1.1 


    GET / HTTP/1.1
    Host: 192.168.1.216
    accept-language: en-US,en;q=0.5
    accept-encoding: gzip, deflate
    accept: */*
    user-agent: Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0
    Connection: keep-alive
    content-type: application/x-www-form-urlencoded; charset=UTF-8

598

asked Sep 01 '17 15:09

Laurinda Souza

2 Answers

Reminder: the formula for entropy is

H(S)=-sum[ P(Xi) * log2 P(Xi) ], where

S is a content you want to calculates it's entropy,

Xi is i-th character in the document, and

P(Xi) is a probability of seeing the character Xi in the content.

The first problem here is to estimate correctly P(Xi). To do it correctly you need to download as many diverse pages as you can. At very least 100, several thousands would be better. This is important, because you need to have a real pages that represent well your domain.

Now, you have to reconstruct HTTP layer from the packets. It is not easy task in real life, because some pages will be split across several packets, and their order of arrival may be not be as you expect, and some packets might be lost and retransmitted. I recommend you to read this blog, to get the grip on the subj.

Also, I suggest you calculate entropy for headers and body of HTTP requests separately. This is because I expect that distribution of characters in the header and the body to be different.

Now, when you have an access to the desired content you just count frequencies of each character. Something like following (doc_collection might a list composed of content of all HTTP headers you have extracted from your PCAPs.):

def estimate_probabilities(doc_collection):
    freq = Counter()
    for doc in doc_collection:
        freq.update(Counter(doc))
    total = 1.0*sum(freq.values())
    P = { k : freq[k]/total for k in freq.keys() }
    return P

Now that you have the probabilities of the characters, calculating entropy is simple:

import numpy as np
def entropy(s, P):
    epsilon = 1e-8
    sum = 0
    for k,v in Counter(s).iteritems():
        sum -= v*P[k]*np.log2(P[k] + epsilon) 
    return sum

If you like, you can even speed it up using map:

import numpy as np
def entropy(s, P):
    epsilon = 1e-8
    return -sum(map(lambda a: a[1] * P[a[0]] * np.log2(P[a[0]] + epsilon), Counter(s).items()))

epsilon is needed to prevent from logarithm to go to minus infinity, if the probability of a symbol is close to zero.

Now, if you want to calculate entropy excluding some characters ("\r" and "\n" in your case) just zero their probabilities, e.g. P['\n'] = 0 That will remove all those characters from the count.

-- updated to answer the comment:

If you want to sum the entropy depending on existence of the substring, your program will look like:

....
P = estimate_probabilities(all_HTTP_headers_list)

....
count_with, count_without = 0, 0
H = entropy(s, P)
if '\r\n\r\n' in s:
    count_with += H
else:
    count_without += H

all_HTTP_headers_list is a concatenation of all the headers you have, s is the specific header.

-- update2: how to read HTTP headers

pyshark is not the best solution for packet manipulation, because it drops the payload, but it is ok to just get the headers.

pkts = pyshark.FileCapture('dump.pcap')

headers = []
for pk in pkts:
    if pk.highest_layer == 'HTTP':
        raw = pk.tcp.payload.split(':')
        headers.append( ''.join([ chr(int(ch, 16)) for ch in raw ]) )

Here you check that your packet actually has HTTP layer, get its payload (from the TCP layer as ':' separated string), then do some string manipulations and at the end receive all HTTP headers from the PCAP as a list.

149

answered Oct 25 '22 18:10

igrinis

While I don't see why you want to do it, I disagree with others who believe it is nonsensical.

You could, for instance take a coin and flip it and and measure its entropy. Suppose you flip 1,000 times and get 500 heads and 500 tails. That is 0.5 frequency for each outcome, or what statisticians would formally call an 'event'.

Now, since the two Ti's are equally (0.5), and the log base 2 of 0.5 is -1, the entropy of the coin is -2 *(0.5 * -1) = -1 (the minus 2 is the minus sign out front and recognizing adding two identical things is the same as multiplying by 2.

What if the coin came up with heads 127 times more often than tails? Tails now occurs with probability 1/128 which has a log base 2 of -7. So that gives a contribution of about 1/32 from multiplying -7 times 1/128 (roughly). Heads have a probability really close to 1. But the log base 2 (or base anything) of 1 is zero. So that term gives roughly zero. Thus, the entropy of that coin is about -1/32, remembering the minus sign (if I did this all right in my head).

So the trick for you is to collect lots of random messages, and count them into two buckets. Then just do the calculations as above.

If you are asking how to do that counting, and you have these on a computer, you can use a tool like grep (the regular expression tool on unix) or a similar utility on other systems. It will sort them for you.

answered Oct 25 '22 19:10

eSurfsnake

Related questions
                            
                                Handling french letters in Python
                            
                                How to convert arrays of x,y,z coordinates to 3D path in numpy
                            
                                How to calculate the percentage of each element in a list?
                            
                                How to use `__slots__` with initialization of attributes?
                            
                                Slow django model instance creation with Docker
                            
                                Why are modules that haven't been imported in 'sys.modules' in Python 3?
                            
                                Flask, processing requests 1 by 1
                            
                                Django channels for asynchronous periodic tasks
                            
                                Rough string alignment in python
                            
                                ipywidgets, how to change slider's value display precision
                            
                                Refactor class method to property using Pycharm
                            
                                Reversing Python's re.escape
                            
                                Where does scikit-learn hold the decision labels of each leaf node in its tree structure?
                            
                                Inserting new rows in pandas data frame at specific indices
                            
                                Unreachable IP Socket Close Time in Windows OS
                            
                                Adding a preprocessing layer to keras model and setting tensor values
                            
                                marisa trie suffix compression?
                            
                                Extending socket.socket with a new attribute
                            
                                Override accepted renderer in django-rest-framework on exception
                            
                                Cannot use line_profiler with Cython

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Computing Shannon entropy of a HTTP header using Python. How to do it?

Tags:

python

python-3.x

python-2.7

entropy

Laurinda Souza

People also ask

2 Answers

igrinis

eSurfsnake

Recent Activity

Donate For Us