Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate the entropy of a file?

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.

My idea is the following:

  • Create an array of 256 integers (all zeros).
  • Traverse through the file and for each of its bytes,
    increment the corresponding position in the array.
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero,
    and for each of the array's entries:
    add the entry's difference to "average" to the counter.

Well, now I'm stuck. How to "project" the counter result in such a way that all results would lie between 0.0 and 1.0? But I'm sure, the idea is inconsistent anyway...

I hope someone has better and simpler solutions?

Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)

like image 359
ivan_ivanovich_ivanoff Avatar asked Jun 13 '09 10:06

ivan_ivanovich_ivanoff


People also ask

What is the entropy of a file?

In other words, it's a measure of the "randomness" of the data in a file - measured in a scale of 1 to 8 (8 bits in a byte), where typical text files will have a low value, and encrypted or compressed files will have a high measure.

How do you find the entropy of a set of data?

For example, in a binary classification problem (two classes), we can calculate the entropy of the data sample as follows: Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))

What is entropy in PE file?

What is Entropy? Entropy is a measure of randomness within a set of data.

How do you calculate text entropy?

To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.


2 Answers

  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero, and for each of the array's entries: add the entry's difference to "average" to the counter.

With some modifications you can get Shannon's entropy:

rename "average" to "entropy"

(float) entropy = 0 for i in the array[256]:Counts do    (float)p = Counts[i] / filesize   if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2 

Edit: As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).

like image 102
Nick Dandoulakis Avatar answered Oct 13 '22 04:10

Nick Dandoulakis


A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).

This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.

like image 27
Igor Krivokon Avatar answered Oct 13 '22 04:10

Igor Krivokon