Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Breaking Vigenere only knowing key length

Problem

I want to decode a message encrypted with classic Viginere. I know that the key has a length of exactly 6 characters.

The message is:

BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM

Question

I tried a brute-force approach but unfortunately this yields an extreme amount of combinations, too many to compute.

Do you have any idea how to go from here or how to approach this problem in general?


Attempt

Here is what i have so far:

public class Main {
    // instance variables - replace the example below with your own
    private String message;
    private String answer;
    private String first;

    /**
     * Constructor for objects of class Main
     */
    public Main()
    {
        // initialise instance variables
        message ="BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM";


        for (int x = 0; x < message.length() / 6; x++) {
            int index = x * 6;
            first = new StringBuilder()
                .append(first)
                .append(message.charAt(index))
                .toString();
        }
        System.out.println(first);
    }
}
like image 900
Ollie Jones Avatar asked Nov 28 '19 17:11

Ollie Jones


2 Answers

Non-text message

In case the raw message is not actual text (like english text that makes sense) or you have no information about its content, you will be out of luck.

Especially if the text is actually hashed or double-encrypted, i.e. random stuff.

Breaking an encryption scheme requires knowledge about the algorithm and the messages. Especially in your situation, you will need to know the general structure of your messages in order to break it.


Prerequisites

For the rest of this answer, let me assume your message is actually plain english text. Note that you can easily adopt my answer to other languages. Or even adopt the techniques to other message formats.

Let me also assume that you are talking about classic Vigenere (see Wikipedia) and not about one of its many variants. That means that your input consists only of the letters A to Z, no case, no interpunction, no spaces. Example:

MYNAMEISJOHN // Instead of: My name is John.

The same also applies to your key, it only contains A to Z.

Classic Viginere then shifts by the offset in the alphabet, modulo the alphabet size (which is 26).

Example:

(G + L) % 26 = R

Dictionary

Before we talk about attacks we need to find a way to, given a generated key, find out whether it is actually correct or not.

Since we know that the message consists of english text, we can just take a dictionary (a huge list of all valid english words) and compare our decrypted message against the dictionary. If the key was wrong, the resulting message will not contain valid words (or only a few).

This can be a bit tricky since we lack interpunction (in particular, no spaces).

N-grams

Good thing that there is actually a very accurate way of measuring how valid a text is, which also solves the issue with the missing interpunction.

The technique is called N-grams (see Wikipedia). You choose a value for N, for example 3 (then called tri-grams) and start splitting your text into pairs of 3 characters. Example:

MYNAMEISJOHN // results in the trigrams:
$$M, $$MY, MYN, YNA, NAM, AME, MEI, ISJ, SJO, JOH, OHN, HN$, N$$

What you need now is a frequency analysis of the most common tri-grams in english text. There exist various sources online (or you can run it yourself on a big text corpus).

Then you simply compare your tri-gram frequency to the frequency for real text. Using that, you compute a score of how well your frequency matches the real frequency. If your message contains a lot of very uncommon tri-grams, it is highly likely to be garbage data and not real text.

A small note, mono-grams (1-gram) result in a single character frequency (see Wikipedia#Letter frequency). Bi-grams (2-gram) are used commonly for cracking Viginere and yield good results.


Attacks

Brute-Force

The first and most straightforward attack is always brute-force. And, as long as the key and the alphabet is not that big, the amount of combinations is relatively low.

Your key has length 6, the alphabet has size 26. So the amount of different key combinations is 6^26, which is

170_581_728_179_578_208_256

So about 10^20. This number might appear huge, but do not forget that CPUs operate already in the Gigahertz range (10^9 operations per second, per core). That means that a single core with 1 GHz will have generated all solutions in about 317 years. Now replace that by a powerful CPU (or even GPU) and with a multi-core machine (there are clusters with millions of cores), then this is computed in less than a day.

But okay, I get that you most likely do not have access to such a hardcore cluster. So a full brute-force is not feasible.

But do not worry. There are simple tricks to speed this up. You do not have to compute the full key. How about limiting yourself to the first 3 characters instead of the full 6 characters. You will only be able to decrypt a subset of the text then, but it is enough to analyze whether the outcome is valid text or not (using dictionaries and N-grams, as mentioned before).

This small change already drastically cuts down computation time since you then only have 3^26 combinations. Generating those takes around 2 minutes for a single 1 GHz core.

But you can do even more. Some characters are extremely rare in english text, for example Z. You can simply start by not considering keys that would translate to those values in the text. Let us say you remove the 6 least common characters by that, then your combinations are only 3^20. This takes around 100 milliseconds for a single 1 GHz core. Yes, milliseconds. That is fast enough for your average laptop.

Frequency Attack

Enough brute-force, let us do something clever. A letter frequency attack is a very common attack against those encryption schemes. It is simple, extremely fast and very successful. In fact, it is so simple that there are quite some online tools that offer this for free, for example guballa.de/vigenere-solver (it is able to crack your specific example, I just tried it out).

While Viginere changes the message to unreadable garbage, it does not change the distribution of letters, at least not per digit of the key. So if you look at, let's say the second digit of your key, from there on, every sixth letter (length of the key) in the message will be shifted by the exact same offset.

Let us take a look at a simple example. The key is BAC and the message is

CCC CCC CCC CCC CCC // raw
DCF DCF DCF DCF DCF // decrypted

As you notice, the letters repeat. Looking at the third letter, it is always F. So that means that the sixth and ninth letter, which are also F, all must be the exact same original letter. Since they where all shifted by the C from the key.

That is a very important observation. It means that letter frequency is, within a multiple of a digit of the key (k * (i + key_length)), preserved.

Let us now take a look at the letter distribution in english text (from Wikipedia):

Letter frequency

All you have to do now is to split your message into its blocks (modulo key-length) and do a frequency analysis per digit of the blocks.

So for your specific input, this yields the blocks

BYOIZR
LAUMYX
XPFLPW
BZLMLQ
PBJMSC
...

Now you analyze the frequency for digit 1 of each block, then digit 2, and so on, until digit 6. For the first digit, this are the letters

B, L, X, B, P, ...

The result for your specific input is:

[B=0.150, E=0.107, X=0.093, L=0.079, Q=0.079, P=0.071, K=0.064, I=0.050, O=0.050, R=0.043, F=0.036, J=0.036, A=0.029, S=0.029, Y=0.021, Z=0.021, C=0.014, T=0.014, D=0.007, V=0.007]
[L=0.129, O=0.100, H=0.093, A=0.079, V=0.071, Y=0.071, B=0.057, K=0.057, U=0.050, F=0.043, P=0.043, S=0.043, Z=0.043, D=0.029, W=0.029, N=0.021, C=0.014, I=0.014, J=0.007, T=0.007]
[W=0.157, Z=0.093, K=0.079, L=0.079, V=0.079, A=0.071, G=0.071, J=0.064, O=0.050, X=0.050, D=0.043, U=0.043, S=0.036, Q=0.021, E=0.014, F=0.014, N=0.014, M=0.007, T=0.007, Y=0.007]
[M=0.150, P=0.100, Q=0.100, I=0.079, B=0.071, Z=0.071, L=0.064, W=0.064, K=0.057, V=0.043, E=0.036, A=0.029, C=0.029, N=0.029, U=0.021, H=0.014, S=0.014, D=0.007, G=0.007, J=0.007, T=0.007]
[L=0.136, Y=0.100, A=0.086, O=0.086, P=0.086, U=0.086, H=0.064, K=0.057, V=0.050, Z=0.050, S=0.043, J=0.029, M=0.021, T=0.021, W=0.021, G=0.014, I=0.014, B=0.007, C=0.007, N=0.007, R=0.007, X=0.007]
[I=0.129, M=0.107, X=0.100, L=0.086, W=0.079, S=0.064, R=0.057, H=0.050, Q=0.050, K=0.043, E=0.036, C=0.029, T=0.029, V=0.029, F=0.021, J=0.021, P=0.021, G=0.014, Y=0.014, A=0.007, D=0.007, O=0.007]

Look at it. You see that for the first digit the letter B is very common, 15%. And then letter E with 10% and so on. There is a high chance that letter B, for the first digit of the key, is an alias for E in the real text (since E is the most common letter in english text) and that the E stands for the second most common letter, namely T.

Using that you can easily reverse-compute the letter of the key used for encryption. It is obtained by

B - E % 26 = X

Note that your message distribution might not necessary align with the real distribution over all english text. Especially if the message is not that long (the longer, the more accurate is the distribution computation) or mainly consists of weird and unusual words.

You can counter that by trying out a few combinations among the highest of your distribution. So for the first digit you could try out whether

B -> E
E -> E
X -> E
L -> E

Or instead of mapping to E only, also try out the second most common character T:

B -> T
E -> T
X -> T
L -> T

The amount of combinations you get with that is very low. Use dictionaries and N-grams (as mentioned before) to validate whether the key is correct or not.

Java Implementation

Your message is actually very interesting. It perfectly aligns with the real letter frequency over english text. So for your particular case you actually do not need to try out any combinations, nor do you need to do any dictionary/n-gram checks. You can actually just translate the most common letter in your encrypted message (per digit) to the most common character in english text, E, and get the real actual key.

Since that is so simple and trivial, here is a full implementation in Java for what I explained before step by step, with some debug outputs (it is a quick prototype, not really nicely structured):

import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public final class CrackViginere {

    private static final int ALPHABET_SIZE = 26;
    private static final char FIRST_CHAR_IN_ALPHABET = 'A';

    public static void main(final String[] args) {
        String encrypted =
                "BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM";
        int keyLength = 6;

        char mostCommonCharOverall = 'E';

        // Blocks
        List<String> blocks = new ArrayList<>();
        for (int startIndex = 0; startIndex < encrypted.length(); startIndex += keyLength) {
            int endIndex = Math.min(startIndex + keyLength, encrypted.length());
            String block = encrypted.substring(startIndex, endIndex);
            blocks.add(block);
        }

        System.out.println("Individual blocks are:");
        blocks.forEach(System.out::println);

        // Frequency
        List<Map<Character, Integer>> digitToCounts = Stream.generate(HashMap<Character, Integer>::new)
                .limit(keyLength)
                .collect(Collectors.toList());

        for (String block : blocks) {
            for (int i = 0; i < block.length(); i++) {
                char c = block.charAt(i);
                Map<Character, Integer> counts = digitToCounts.get(i);
                counts.compute(c, (character, count) -> count == null ? 1 : count + 1);
            }
        }

        List<List<CharacterFrequency>> digitToFrequencies = new ArrayList<>();
        for (Map<Character, Integer> counts : digitToCounts) {
            int totalCharacterCount = counts.values()
                    .stream()
                    .mapToInt(Integer::intValue)
                    .sum();
            List<CharacterFrequency> frequencies = new ArrayList<>();
            for (Map.Entry<Character, Integer> entry : counts.entrySet()) {
                double frequency = entry.getValue() / (double) totalCharacterCount;
                frequencies.add(new CharacterFrequency(entry.getKey(), frequency));
            }
            Collections.sort(frequencies);
            digitToFrequencies.add(frequencies);
        }

        System.out.println("Frequency distribution for each digit is:");
        digitToFrequencies.forEach(System.out::println);

        // Guessing
        StringBuilder keyBuilder = new StringBuilder();
        for (List<CharacterFrequency> frequencies : digitToFrequencies) {
            char mostFrequentChar = frequencies.get(0)
                    .getCharacter();
            int keyInt = mostFrequentChar - mostCommonCharOverall;
            keyInt = keyInt >= 0 ? keyInt : keyInt + ALPHABET_SIZE;
            char key = (char) (FIRST_CHAR_IN_ALPHABET + keyInt);
            keyBuilder.append(key);
        }

        String key = keyBuilder.toString();
        System.out.println("The guessed key is: " + key);

        System.out.println("Decrypted message:");
        System.out.println(decrypt(encrypted, key));
    }

    private static String decrypt(String encryptedMessage, String key) {
        StringBuilder decryptBuilder = new StringBuilder(encryptedMessage.length());
        int digit = 0;
        for (char encryptedChar : encryptedMessage.toCharArray())
        {
            char keyForDigit = key.charAt(digit);

            int decryptedCharInt = encryptedChar - keyForDigit;
            decryptedCharInt = decryptedCharInt >= 0 ? decryptedCharInt : decryptedCharInt + ALPHABET_SIZE;
            char decryptedChar = (char) (decryptedCharInt + FIRST_CHAR_IN_ALPHABET);
            decryptBuilder.append(decryptedChar);

            digit = (digit + 1) % key.length();
        }
        return decryptBuilder.toString();
    }

    private static class CharacterFrequency implements Comparable<CharacterFrequency> {
        private final char character;
        private final double frequency;

        private CharacterFrequency(final char character, final double frequency) {
            this.character = character;
            this.frequency = frequency;
        }

        @Override
        public int compareTo(final CharacterFrequency o) {
            return -1 * Double.compare(frequency, o.frequency);
        }

        private char getCharacter() {
            return character;
        }

        private double getFrequency() {
            return frequency;
        }

        @Override
        public String toString() {
            return character + "=" + String.format("%.3f", frequency);
        }
    }
}

Decrypted

Using above code, the key is:

XHSIHE

And the full decrypted message is:

ERWASNOTCERTAINDISESTEEMSURELYTHENHEMIGHTHAVEREGARDEDTHATABHORRENCEOFTHEUNINTACTSTATEWHICHHEHADINHERITEDWITHTHECREEDOFMYSTICISMASATLEASTOPENTOCORRECTIONWHENTHERESULTWASDUETOTREACHERYAREMORSESTRUCKINTOHIMTHEWORDSOFIZZHUETTNEVERQUITESTILLEDINHISMEMORYCAMEBACKTOHIMHEHADASKEDIZZIFSHELOVEDHIMANDSHEHADREPLIEDINTHEAFFIRMATIVEDIDSHELOVEHIMMORETHANTESSDIDNOSHEHADREPLIEDTESSWOULDLAYDOWNHERLIFEFORHIMANDSHEHERSELFCOULDDONOMOREHETHOUGHTOFTESSASSHEHADAPPEAREDONTHEDAYOFTHEWEDDINGHOWHEREYESHADLINGEREDUPONHIMHOWSHEHADHUNGUPONHISWORDSASIFTHEYWEREAGODSANDDURINGTHETERRIBLEEVENINGOVERTHEHEARTHWHENHERSIMPLESOULUNCOVEREDITSELFTOHISHOWPITIFULHERFACEHADLOOKEDBYTHERAYSOFTHEFIREINHERINABILITYTOREALIZETHATHISLOVEANDPROTECTIONCOULDPOSSIBLYBEWITHDRAWNTHUSFROMBEINGHERCRITICHEGREWTOBEHERADVOCATECYNICALTHINGSHEHADUTTEREDTOHIMSELFABOUTHERBUTNOMANCANBEALWAYSACYNI

Which is more or less valid english text:

er was not certain disesteem surely then he might have regarded that
abhorrence of the unintact state which he had inherited with the creed
of my sticismas at least open to correction when the result was due to
treachery are morse struck into him the words of izz huett never quite
still ed in his memory came back to him he had asked izz if she loved
him and she had replied in the affirmative did she love him more than
tess did no she had replied tess would lay down her life for him and she
herself could do no more he thought of tess as she had appeared on the day
of the wedding how here yes had lingered upon him how she had hung upon
his words as if they were a gods and during the terrible evening over
the hearth when her simple soul uncovered itself to his how pitiful her
face had looked by the rays of the fire inherinability to realize that
his love and protection could possibly be withdrawn thus from being her
critiche grew to be her advocate cynical things he had uttered to
himself about her but noman can be always acyn I

Which, by the way, is a quote from the british novel Tess of the d'Urbervilles: A Pure Woman Faithfully Presented. Phase the Sixth: The Convert, Chapter XLIX.

like image 159
Zabuzard Avatar answered Sep 25 '22 13:09

Zabuzard


Standard Vigenere interleaves Caesar shift cyphers, specified by the key. If the Vigenere key is six characters long, then letters 1, 7, 13, ... of the ciphertext are on one Caesar shift -- every sixth character uses the first character of the key. Letter 2, 8, 14 ... of the ciphertext use a different (in general) Caesar shift and so on.

That gives you six different Caesar shift ciphers to solve. The text will not be in English, due to picking every sixth letter, so you will need to solve it by letter frequency. That will give you a few good options for each position of the key. Try them in order of probability to see which gives the correct decryption.

like image 21
rossum Avatar answered Sep 23 '22 13:09

rossum