Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text deciphering, letter frequency based approach (questions about cost function)

I would like to decipher texts based on frequency analysis. Programming is not the problem, but there are some mathematical difficulties.

(No worries, not for hacking, I want to have a go at the Zodiac 340 cipher, but the question is just in general about deciphering http://zodiackillerciphers.com/wiki/images/7/7d/340-cipher-hi-resolution.jpg, not about the other problems with the cipher.)

I've broken it down to 5 brief questions all related to the cost function to show my effort, short answers are fine, any help appreciated. My problem is that the differences of the values in the cost function is very small.

Given

  1. A text with any numbers of symbols, called the cipher from now on. The cipher is in English. Each symbol in the cipher stands only for one letter but one letter might be expressed through several symbols. We don't know if there are any spaces (but the String that has to be evaluated by the cost function will be space-delimited and has only letters A-Z).
  2. Letter frequency analysis (A-Z and space) for: single letter, letter pairs and letter triplets. The 4000 most common words in English or "all" words using sowpods scrabble dictionary.

Questions about frequency analysis:

  1. Is it better just to check for the most common words or for all words using sowpods (maybe removing 2 and 3 letter words that are not in the 4000 most common words)?
  2. For the letter pairs and triplets: Is it better to store just their frequency over all, or store it in the form P(A|B) (probability an A is following a B) and P(C|AB) for triplets?

Concept

Skip if not interested. I don't want to go into much details here, there are several methods that could be used. A rough sketch:

  1. Generate a (semi-)random solution
  2. Local optimization of the solution based on a cost functon
  3. Start over and transfer some of the knowledge acquired
  4. After stagnation for some time try the same thing with the introduction of spaces at fixed positions before the local optimization (in case the message has no spaces)
  5. compare the 2 retrieved solutions and return the better one

Cost function

How would a cost function look like? The general one could be expressed as:

w1 * letterCost + w2 * pairCost + w3 * tripletCost + w4 * wordCost

and the sum of all wheights is one:

w1 + w2 + w3 + w4 = 1

Questions about the cost function

  1. Now with the simple frequencies ignoring words (w4 = 0) you could just count the frequencies and take the squared difference (this is what I'm currently doing). What I wonder here is: Is it more reasonable to have w1 = w2 = w3 or w1 = 27 * w2 = 27 * 27 * w3 ?

  2. How would it work with the conditional probabilities?

  3. How do you incorporate the knowledge about the words? Just count how many real English words there are, probably weighting them by their length or is there a more intelligent way?

like image 918
maraca Avatar asked Apr 18 '15 21:04

maraca


1 Answers

In my opinion your questions arise from too general concept. It's impossible to calculate cost function, if you won't precise an algorithm. I can propose an approach to precise second point of your concept:

  1. Calculate expected values for random (for example: If you have 100 000 letters, random triplet should occur 5 times)
  2. Let n be number of letters in your ciphered text. Then for each letter increase value of Letter[y], Pair[y][y+1], Triplet[y][y+1][y+2]
  3. If occurence of some data is significantly greater than values calculted in 1. then try to judge how close to answer you are.

Still, point 3 and "judging" is very general, but based on this I can give you few answers:

Questions about the cost function

  1. It's better to use only the most common words, because it gives you information about deviation from random results. Holding all words gives you no profit.
  2. Frequency is my recomendation. I can't find any usage of holding the conditional probabilities.

Cost function

In my case cost of algortihm is O(n) + const (for long words you can consider using hashtables) + "judge". Problem continues, because many depends on how "judge" will be solved.

  1. I dont know why you choose to calculate cost function like that, but for me w1 = 27 * w2 = 27 * 27 * w3 sounds more reasonable, because its less likely to have many occurence over average of long words.
  2. In my solution it is no need and advantages in using conditional probabilities.
  3. And this question is another big problem and in my opinion has many common with "Generate a (semi-)random solution" . Lets say that you guessed letters 't', 'h', 'e', 'y', . Your algorithm should detect words 'the', 'them', 'they', but totally miss words 'and', 'work', 'no', 'will'. You can use characteristics of words, like 'the' is common prefix, in 'will' 3 and 4 letter are the same, etc. This complicate the solution, but it should give better results.
like image 61
Marcin J Avatar answered Oct 23 '22 04:10

Marcin J