The Shannon entropy is:
\r\n\r\n is the end of a HTPP header:
Incomplete HTTP header:
I have a network dump in PCAP format (dump.pcap) and I am trying to compute the entropy of the number of packets in HTTP protocol with \r\n\r\n
and without \r\n\r\n
in the header using Python and compare them. I read the packets using:
import pyshark
pkts = pyshark.FileCapture('dump.pcap')
I think Ti
in shannon formula is the data of my dump file.
dump.pcap: https://uploadfiles.io/y5c7k
I already computed the entropy of IP numbers:
import numpy as np
import collections
sample_ips = [
"131.084.001.031",
"131.084.001.031",
"131.284.001.031",
"131.284.001.031",
"131.284.001.000",
]
C = collections.Counter(sample_ips)
counts = np.array(list(C.values()),dtype=float)
#counts = np.array(C.values(),dtype=float)
prob = counts/counts.sum()
shannon_entropy = (-prob*np.log2(prob)).sum()
print (shannon_entropy)
Any idea? Is it possible to compute the entropy of the number of packets in HTTP protocol with \r\n\r\n
and without \r\n\r\n
in the header or it is a nonsense idea?
A few lines of the dump:
30 2017/246 11:20:00.304515 192.168.1.18 192.168.1.216 HTTP 339 GET / HTTP/1.1
GET / HTTP/1.1
Host: 192.168.1.216
accept-language: en-US,en;q=0.5
accept-encoding: gzip, deflate
accept: */*
user-agent: Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0
Connection: keep-alive
content-type: application/x-www-form-urlencoded; charset=UTF-8
How do you calculate entropy in Python? Calculate the entropy of a distribution for given probability values. If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=axis) . If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=axis) .
Shannon Entropy is one such information theory method that given a random variable and historic about this variable occurrence can quantify the average level of information! Being able to quantify the amount of information in a sequence is important in many fields and applies to many data set.
The Shannon entropy is defined as HSh = −Σipilog pi, while the von Neumann entropy is defined as HvN = −Tr ρ log ρ.
The entropy of an image can be calculated by calculating at each pixel position (i,j) the entropy of the pixel-values within a 2-dim region centered at (i,j). In the following example the entropy of a grey-scale image is calculated and plotted. The region size is configured to be (2N x 2N) = (10,10).
Reminder: the formula for entropy is
H(S)=-sum[ P(Xi) * log2 P(Xi) ]
, where
S
is a content you want to calculates it's entropy,
Xi
is i-th
character in the document, and
P(Xi)
is a probability of seeing the character Xi
in the content.
The first problem here is to estimate correctly P(Xi)
. To do it correctly you need to download as many diverse pages as you can. At very least 100, several thousands would be better. This is important, because you need to have a real pages that represent well your domain.
Now, you have to reconstruct HTTP layer from the packets. It is not easy task in real life, because some pages will be split across several packets, and their order of arrival may be not be as you expect, and some packets might be lost and retransmitted. I recommend you to read this blog, to get the grip on the subj.
Also, I suggest you calculate entropy for headers and body of HTTP requests separately. This is because I expect that distribution of characters in the header and the body to be different.
Now, when you have an access to the desired content you just count frequencies of each character. Something like following (doc_collection
might a list composed of content of all HTTP headers you have extracted from your PCAPs.):
def estimate_probabilities(doc_collection):
freq = Counter()
for doc in doc_collection:
freq.update(Counter(doc))
total = 1.0*sum(freq.values())
P = { k : freq[k]/total for k in freq.keys() }
return P
Now that you have the probabilities of the characters, calculating entropy is simple:
import numpy as np
def entropy(s, P):
epsilon = 1e-8
sum = 0
for k,v in Counter(s).iteritems():
sum -= v*P[k]*np.log2(P[k] + epsilon)
return sum
If you like, you can even speed it up using map
:
import numpy as np
def entropy(s, P):
epsilon = 1e-8
return -sum(map(lambda a: a[1] * P[a[0]] * np.log2(P[a[0]] + epsilon), Counter(s).items()))
epsilon
is needed to prevent from logarithm to go to minus infinity, if the probability of a symbol is close to zero.
Now, if you want to calculate entropy excluding some characters ("\r" and "\n" in your case) just zero their probabilities, e.g. P['\n'] = 0
That will remove all those characters from the count.
-- updated to answer the comment:
If you want to sum the entropy depending on existence of the substring, your program will look like:
....
P = estimate_probabilities(all_HTTP_headers_list)
....
count_with, count_without = 0, 0
H = entropy(s, P)
if '\r\n\r\n' in s:
count_with += H
else:
count_without += H
all_HTTP_headers_list
is a concatenation of all the headers you have, s
is the specific header.
-- update2: how to read HTTP headers
pyshark
is not the best solution for packet manipulation, because it drops the payload, but it is ok to just get the headers.
pkts = pyshark.FileCapture('dump.pcap')
headers = []
for pk in pkts:
if pk.highest_layer == 'HTTP':
raw = pk.tcp.payload.split(':')
headers.append( ''.join([ chr(int(ch, 16)) for ch in raw ]) )
Here you check that your packet actually has HTTP layer, get its payload (from the TCP layer as ':' separated string), then do some string manipulations and at the end receive all HTTP headers from the PCAP as a list.
While I don't see why you want to do it, I disagree with others who believe it is nonsensical.
You could, for instance take a coin and flip it and and measure its entropy. Suppose you flip 1,000 times and get 500 heads and 500 tails. That is 0.5 frequency for each outcome, or what statisticians would formally call an 'event'.
Now, since the two Ti's are equally (0.5), and the log base 2 of 0.5 is -1, the entropy of the coin is -2 *(0.5 * -1) = -1 (the minus 2 is the minus sign out front and recognizing adding two identical things is the same as multiplying by 2.
What if the coin came up with heads 127 times more often than tails? Tails now occurs with probability 1/128 which has a log base 2 of -7. So that gives a contribution of about 1/32 from multiplying -7 times 1/128 (roughly). Heads have a probability really close to 1. But the log base 2 (or base anything) of 1 is zero. So that term gives roughly zero. Thus, the entropy of that coin is about -1/32, remembering the minus sign (if I did this all right in my head).
So the trick for you is to collect lots of random messages, and count them into two buckets. Then just do the calculations as above.
If you are asking how to do that counting, and you have these on a computer, you can use a tool like grep (the regular expression tool on unix) or a similar utility on other systems. It will sort them for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With