I've been playing around with cryptocat, which is an interesting online chat service that allows you to encrypt your messages with a key, so that only people with the same key can read your message. An interesting aspect of the service (in my opinion) is the fact that text encrypted using a key other than the one that you're using is displayed simply as "[encrypted]", rather than a bunch of garbage cipher text. My question is, in Python, is there a good way to determine whether or not a given piece of text is cipher text? I'm using RC4 for this example, because it was the fastest thing I could implement (based on the pseudo-code on Wikipedia. Thanks.
there is no guaranteed way to tell, but in practice you can do two things:
check for many non-ascii characters (if you're expecting people to be sending english text).
check the distribution of values. in normal text, some letters are much more common than others. but in encrypted text, all characters are about equally likely.
a simple way of doing the latter is to see if any character occurs more than (N/256) + 5 * sqrt(N/256) times (where you have a total of N characters), in which case it's likely a natural language (unencrypted).
in python (reversing the logic above, to give "true" when encrypted):
def encrypted(text):
scores = defaultdict(lambda: 0)
for letter in text: scores[letter] += 1
largest = max(scores.values())
average = len(text) / 256.0
return largest < average + 5 * sqrt(average)
the maths comes from the average number being a gaussian distribution around the average, with a variance equal to the average - it's not perfect, but it's probably close enough. by default (with small amounts of text, when it is unreliable) this will return false (sorry; earlier i had an incorrect version with "max()" which had the logic for small numbers the wrong way round).
Every cipher worth its name will produce output that appears to be completely random. You can exploit this fact for a quick test whether you are dealing with encrypted text or rather data that follows some unknown protocol. If the data is encrypted, then you could check the distribution of byte values in a byte stream you can eavesdrop on - if all values are uniformly distributed then there's a good chance you're dealing with encrypted text.
To gain more and more confidence in the decision you could widen the tests to something more sophisticated such as analyzing the distribution of pairs or triplets of bytes etc.
On the other hand you could also compare the statistical data on digrams and trigrams of your particular language of interest with the occurrences in the data you observe (see also here). If your data behaves similar then it's more likely that you are observing plain text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With