Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding words in a non-spaced paragraph?

I am using python to build a Caesar cipher decrypter, It works and decrypts the already encrypted word. However, it shows all its brute force decryption attempts, for example, "HELLO" encrypted with a key of 3 is KHOOR. The results after the decryption is "KHOORJGNNQIFMMPHELLOGDKKNFCJJMEBIILDAHHKCZGGJBYFFIAXEEHZWDDGYVCCFXUBBEWTAADVSZZCURYYBTQXXASPWWZROVVYQNUUXPMTTWOLSSVNKRRUMJQQTLIPPS" I am wondering if there is a way to use a dictionary with Python to search for an English word in this output or can I improve my code to only print out known English words. Apologies if this has been asked before, I searched around and couldn't seem to find the right thing.

like image 730
Andrew Stewart Avatar asked Mar 21 '23 10:03

Andrew Stewart


2 Answers

englishWords = ['HELLO', 'ME', 'AXE', 'FOO', 'BAR', 'BAZ'] #and many more
cypher = 'KHOORJGNNQIFMMPHELLOGDKKNFCJJMEBIILDAHHKCZGGJBYFFIAXEEHZWDDGYVCCFXUBBEWTAADVSZZCURYYBTQXXASPWWZROVVYQNUUXPMTTWOLSSVNKRRUMJQQTLIPPS'

for word in englishWords:
    if word not in cypher: continue
    print('Found "{}"'.format(word))

This yields:

Found "HELLO"
Found "ME"
Found "AXE"

If this is about seeing, if the key for decyphering a text is the correct one, i.e. if the result might be English words, I wouldn't look for words, but try to find clusters inside the result which do not comply with the English syllable apartus.


Here a very naive implementation scanning for letter frequencies:

#! /usr/bin/python3

plain = 'Z RD NFEUVIZEX ZW KYVIV ZJ R NRP KF LJV R UZTKZFERIP NZKY GPKYFE KF JVRITY WFI RE VEXCZJY NFIU ZE KYZJ FLKGLK FI TRE Z ZDGIFMV DP TFUV KF FECP GIZEK FLK BEFNE VEXCZJY NFIUJ. RGFCFXZVJ ZW KYZJ YRJ SVVE RJBVU SVWFIV, Z JVRITYVU RIFLEU REU TFLCUE\'K JVVD KF WZEU KYV IZXYK KYZEX.'.upper ()

freqs = {'E': 12.7, 'T': 9.1, 'A': 8.2, 'O': 7.5, 'I': 7.0}

def cypher(text, key):
    return ''.join(chr((ord(c) - ord('A') + key) % 26 + ord('A')) if 'A' <= c <= 'Z' else c for c in text)


def crack(text):
    length = len(text)
    best = 100000
    bestMatch = ''
    for key in range(26):
        cand = cypher(text, key)
        quality = 0
        for l, c in {letter: sum(1 for c in cand if c == letter) for letter in 'ETAOI'}.items():
            quality += (c / length - freqs[l]) ** 2
        if quality < best:
            best = quality
            bestMatch = cand
    return bestMatch

print(crack(plain))

Here three examples:

Input: TQ ESTD TD LMZFE DPPTYR, TQ ESP VPJ QZC OPNJASPCTYR L EPIE TD ESP NZCCPNE ZYP, T.P. TQ ESP CPDFWE XTRSE MP PYRWTDS HZCOD, T HZFWOY'E WZZV QZC HZCOD, MFE ECJ EZ QTYO NWFDEPCD TYDTOP ESP CPDFWE HSTNS OZ YZE NZXAWJ HTES ESP PYRWTDS DJWWLMWP LALCEFD.

Output: IF THIS IS ABOUT SEEING, IF THE KEY FOR DECYPHERING A TEXT IS THE CORRECT ONE, I.E. IF THE RESULT MIGHT BE ENGLISH WORDS, I WOULDN'T LOOK FOR WORDS, BUT TRY TO FIND CLUSTERS INSIDE THE RESULT WHICH DO NOT COMPLY WITH THE ENGLISH SYLLABLE APARTUS.

Input: KWSJUZAFY XGJ WFYDAKZ OGJVK AF S TDGUC GX MFVAXXWJWFLASLWV LWPL DACW LZSL AK UWJLSAFDQ HGKKATDW, SFV VGAFY AL WXXAUAWFLDQ AK S YWFMAFWDQ AFLWJWKLAFY HJGTDWE. TML AL'K HJGTDWESLAU XGJ DGLK GX JWSKGFK, BMKL GFW GX OZAUZ AK LZSL QGMJ WFUJQHLWV LWPL ESQ AFUDMVW LWPL LZSL JSFVGEDQ ZSHHWFK LG XGJE SF WFYDAKZ OGJV UGEHDWLWDQ TQ UZSFUW.

Output: SEARCHING FOR ENGLISH WORDS IN A BLOCK OF UNDIFFERENTIATED TEXT LIKE THAT IS CERTAINLY POSSIBLE, AND DOING IT EFFICIENTLY IS A GENUINELY INTERESTING PROBLEM. BUT IT'S PROBLEMATIC FOR LOTS OF REASONS, JUST ONE OF WHICH IS THAT YOUR ENCRYPTED TEXT MAY INCLUDE TEXT THAT RANDOMLY HAPPENS TO FORM AN ENGLISH WORD COMPLETELY BY CHANCE.

Input: QZC PILXAWP, UFDE ESP EPIE JZF'GP AZDEPO SPCP TYNWFOPD SPWW, TQ, WTA, WZR, LDA LYO ACZMLMWJ ZESPCD. JZF NZFWO ECTX OZHY ESP LWEPCYLETGPD MJ ZYWJ DPLCNSTYR QZC HZCOD ESP DLXP WPYRES LD JZFC ELCRPE HZCO, LYO ZYWJ QZC HZCOD HTES ESP DLXP WPEEPC ALEEPCY. MFE ESLE'D CPLWWJ BFTEP L WZE ZQ HZCV EZ RPE LCZFYO ESP QLNE ESLE JZFC TYTETLW ZFEAFE SLD L WZE ZQ FDPWPDD OLEL TY TE.

Output: FOR EXAMPLE, JUST THE TEXT YOU'VE POSTED HERE INCLUDES HELL, IF, LIP, LOG, ASP AND PROBABLY OTHERS. YOU COULD TRIM DOWN THE ALTERNATIVES BY ONLY SEARCHING FOR WORDS THE SAME LENGTH AS YOUR TARGET WORD, AND ONLY FOR WORDS WITH THE SAME LETTER PATTERN. BUT THAT'S REALLY QUITE A LOT OF WORK TO GET AROUND THE FACT THAT YOUR INITIAL OUTPUT HAS A LOT OF USELESS DATA IN IT.

And here a last example without spacing and punctuation:

Input: ZRDLJZEXGPKYFEKFSLZCURTRVJRITZGYVIUVTIPGKVIZKNFIBJREUUVTIPGKJKYVRCIVRUPVETIPGKVUNFIUYFNVMVIZKJYFNJRCCZKJ SILKVWFITVUVTIPGKZFERKKVDGKJWFIVORDGCVYVCCFVETIPGKVUNZKYRBVPFW3ZJBYFFI

Output: IAMUSINGPYTHONTOBUILDACAESARCIPHERDECRYPTERITWORKSANDDECRYPTSTHEALREADYENCRYPTEDWORDHOWEVERITSHOWSALLITS BRUTEFORCEDECRYPTIONATTEMPTSFOREXAMPLEHELLOENCRYPTEDWITHAKEYOF3ISKHOOR
like image 180
Hyperboreus Avatar answered Mar 24 '23 05:03

Hyperboreus


Searching for English words in a block of undifferentiated text like that is certainly possible, and doing it efficiently is a genuinely interesting problem. But it's problematic for lots of reasons, just one of which is that your encrypted text may include text that randomly happens to form an English word completely by chance.

For example, just the text you've posted here includes HELL, IF, LIP, LOG, ASP and probably others. You could trim down the alternatives by only searching for words the same length as your target word, and only for words with the same letter pattern. But that's really quite a lot of work to get around the fact that your initial output has a lot of useless data in it.

You can check easily to find out whether a specific word is in an English dictionary by doing this:

  1. Read in the lines from a dictionary file (/usr/share/dict/words on most systems).
  2. Strip whitespace, convert to lowercase and store each line in a Python dictionary.
  3. After decrypting each word, check to see if it's present as a key in the Python dictionary.

Taking that approach probably makes a lot more sense than trying to grovel through the unspaced initial output.

like image 36
Tim Pierce Avatar answered Mar 24 '23 05:03

Tim Pierce