How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.
My idea is the following:
Well, now I'm stuck. How to "project" the counter result in such a way that all results would lie between 0.0 and 1.0? But I'm sure, the idea is inconsistent anyway...
I hope someone has better and simpler solutions?
Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
In other words, it's a measure of the "randomness" of the data in a file - measured in a scale of 1 to 8 (8 bits in a byte), where typical text files will have a low value, and encrypted or compressed files will have a high measure.
For example, in a binary classification problem (two classes), we can calculate the entropy of the data sample as follows: Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))
What is Entropy? Entropy is a measure of randomness within a set of data.
To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.
- At the end: Calculate the "average" value for the array.
- Initialize a counter with zero, and for each of the array's entries: add the entry's difference to "average" to the counter.
With some modifications you can get Shannon's entropy:
rename "average" to "entropy"
(float) entropy = 0 for i in the array[256]:Counts do (float)p = Counts[i] / filesize if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2
Edit: As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).
A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).
This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With