Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any good way to "encode" binary data as plausible madeup words and back again?

Tags:

encoding

To give you a very simple and bad example. The data is split in 4 bits. The 16 possible numbers correspond to the first 16 consonants. You add a random vowel to make it pronounceable. So "08F734F7" can become "ba lo ta ku fo go ta ka". You can join some syllables and add punctuation and capitalization and it can become "Balo ta kufogo, Taka?" which looks like a plausible language.

Just to make it clear, I'm not trying to protect the binary data.

I want to use this after I compress and encrypt my (UTF-8) plain text diary. The resulting binary data should look pretty random. I need to convert this data into a plausible looking language and be able to revert it back. I'm going to print the "language" on paper and make a custom book.

So what I'm looking for is the best method of converting random data to readable plausible words. By good I mean the biggest bits to letters ratio (while making it look like a real language). In my example it's exactly 2 bits per letter. Or 4 letters for a byte.

like image 360
Andrei Bulgaru Avatar asked Jan 20 '11 19:01

Andrei Bulgaru


People also ask

Can you think of a benefit of having message ENC Base64 encoded?

From wiki: “Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport”.

What encoding is used to convert binary to text?

Description. The ASCII text-encoding standard uses 7 bits to encode characters. With this it is possible to encode 128 (i.e. 27) unique values (0–127) to represent the alphabetic, numeric, and punctuation characters commonly used in English, plus a selection of control codes which do not represent printable characters.

What does Base64 encoding do?

Fundamentally, Base64 is used to encode binary data as printable text. This allows you to transport binary over protocols or mediums that cannot handle binary data formats and require simple text. [ Download now: A sysadmin's guide to Bash scripting. ] Base64 uses 6-bit characters grouped into 24-bit sequences.

What is encoding for binary?

Binary encoding uses the binary digit, or bit, as the fundamental unit of information, and a bit may only be a '0' or a '1' (only two possibilities since it is a binary-encoded system). By combining bits, numbers larger than 0 or 1 may be represented, and these bit collections are called words.


3 Answers

FASCINATING question!

My best solution so far encodes 12 bits in 2 to 4 characters, giving 3 to 6 bits per letter. (Friday is not a good day to do the necessary maths on the uneven distribution of word lengths, so I haven't worked out the average bits per letter).

The idea is to work with "phonemes" that start with one or two consonants, and end with one or two vowels. There are 21 consonants and I feel that each one could be followed by an h, l, r, w or y, and still look reasonable. So your phoneme starts with one of 126 consonant parts - b, bh, bl, br, bw, by, c, ch, cl, cr, ..., z, zh, zl, zr, zw, zy (admittedly, thinks like yy and zl look a bit weird, but it is a foreign language after all :) )

126 is so tantalisingly close to 128 that we could add t' and b' (for example) for the last two values - giving us a dictionary of 128 values, to store 7 bits. You could even add replace yy with d' and zl with p' or whatever.

Similarly, the vowel portion could be a single vowel or a pair of vowels. I have eliminated aa, ii and uu because they look too weird to me (personal preference) even though they do occur in some real words (who decided "continuum" should be spelt that way anyway!). So that gives 27 possible vowel parts: a, e, i, o, u, ae, ai, ao, ..., ue, ui, uo.

27 is close to 32, so throw in 5 values using an accented vowels (é, â, etc). That gives us 5 bits with the added benefit of some sparse accenting.

So that's 12 bits in 2, 3 or 4 letters.

For more fun, if the next bit is a 1, insert a space 90% of the time (at random), or a punctuation mark the other 10% - but if the next bit is a 0, don't insert anything - just start the next phoneme. Capitalise the first letter after punctuation.

That should give you something like:

Bwaijou t'ei plo ku bhaproti! Llanoi proimlaroo jaévli.

Maybe someone can take it further.

like image 78
Gavi Lock Avatar answered Oct 19 '22 08:10

Gavi Lock


Summary


This solution will let you use any of a large number of languages including pronounceable nonsense with a customizable efficiency. You can even create something that looks like grammatically correct but meaningless English or French (or worse, something that shifts between the two like a drunken polyglot). The basic idea is to use your data to continually select paths from a context free grammar until you run out of data.

Details


Add a string to the end of your input that doesn't occur anywhere inside of it ("This is the end of my input, thank you very much" would be very unlikely to occur in a string of encrypted text, for example.) You can do this without the string but it makes it easier.

Treat your input as one very long integer encoded low-bit first. Obviously your machine won't be able to process such a big integer, every time you have a zero high byte, just strip off the next byte worth of values from your file and multiply them in.

Create your language as a context free grammar. To avoid forgetting what the encoding is, you can print it at the end of your book. Avoid ambiguity. If your grammar is ambiguous, you won't be able to decode. This is not hard, essentially don't use the same terminal in two places, ensure that the concatenation of two terminals cannot make another terminal, and ensure that reading the output you can tell where you put the formatting symbols.

Now, to take an integer and turn it into language, use the following pseudo-code that uses n to choose which production to take.

cur=grammar.root (cur is a list of tokens)
n=my input as one big integer 
while(n > 0 || cur != grammar root){
    if (cur.first.isTerminalSymbol) { 
        output cur.first
        cur.pop_first
        if(cur.isEmpty){
            cur = grammar root
        }
    }else{
        p = grammar[cur.first].number of productions
        t = n mod p // t = low order digit base p
        n = n div p // Shift left in base p
        cur.pop_first
        cur.push_first( grammar[cur.first].productionNumber[t] )
    }
}

To decode you use a standard parser generator like GNU bison which should also help you avoid creating an ambiguous grammar.

Run the parser on the input. Now, start n at 0. You can get the production number at each time by referencing the syntax tree generated by the parser. Then multiply n by the number of productions and add the production number to get n after that particular input. As you fill up the lower byte of your machine word, shift it off into your decoded file. When you read your unique phrase, stop decoding.

Example 1


This will be clearer with an example or three.

My example simple language is as follows (non-terminals are capitalized). Note because of the large size of the terminals compared with their depth of the tree, it is not very efficient but I think that having more terminals or making them shorter can give you any efficiency you want (up to the number of bits wasted per character by using n bits per character).

  • S -> __capital Noun I-Verb Punct | __capital Noun T-Verb Noun Punct
  • Noun -> joe | sally | spot | the car | the cookie | the hose
  • I-Verb -> runs | lives | jumps | flies
  • T-Verb -> jumps over | eats | grows | spins
  • Punct -> . | ! | ?

You could just as easily do this with syllables as an expansion of verbs and nouns. You could also include noun-phrases and verb phrases to have adjectives etc. in your language. You will probably also want paragraph and chapter symbols that break down into appropriate subunits with formatting. The number of alternate choices at a certain level of the tree determines the average number of bits encoded by each symbol. __capital is an example of a formatting symbol that, on output, capitalizes the next word.

So, imagine that my input becomes the number 77. Then I would encode it as follows:

S goes to two things. 77 % 2 = 1. Remainder 77 / 2 = 38.

Now our number is 38 and we are expanding __capital, Noun, T-Verb, Noun, Punct

First word is __capital which is a terminal symbol. Output __capital (setting the print routine to capitalize the next word).

Now expanding Noun. Noun has 6 options. 38 % 6 = 2. 38 / 6 = 6. We choose spot

Now expanding spot,T-verb,Noun,Punct. Spot is terminal. Output spot. The printer being in capital mode writes "Spot" to the output file.

Now expanding T-Verb. Our number is 6. T-verb has 4 options. 6 % 4 = 2. 6 / 4 = 1. So we choose "grows". In the next step we output grows to our file since it is a terminal.

Now expanding Noun, Punct. Noun has 6 options. Our number is 1. 1 % 6 = 1 1/6 = 0. So we choose "sally", which is output on the next step.

Finally we are expanding Punct which has 3 options. Our number is 0 (and will stay that way forever - this is why you append an end-of-text string to the end of your input, otherwise your decoding would end with an infinite string of zeroes.) We choose ".", which is output.

Now the current string to expand is empty so we set it back to the root "S". But since n is 0, the algorithm terminates.

Thus 77 has become "Spot grows sally."

Example 2


Things get more efficient if I replace my terminals with:

  • I-Verb IVS _space | IVS I-Verb
  • IVS IVSS vowel
  • IVSS w | r
  • Vowel a | e | i | o | u | y
  • T-Verb TVS _space | TVS T-Verb
  • TVS TVSS vowel
  • TVSS p | s
  • Noun NS _space | NS
  • NS NSS Vowel
  • NSS j | v

77 yields "Jo papa ja." under this encoding (and is really encoded by just the "Jo " and the fact that papa has 2 syllables. The extra would be a very small fraction in any book-length file.)

Example 3


Your example "08F734F7" would be "1000111101110011010011110111" in binary, which is "1110111100101100111011110001" when reversed and that is, 250793713 in decimal.

If I run that through the more compact grammar, I get:

25079713 % 2 = 1  n=125396856, S-> __capital Noun T-Verb Noun Punct
125396856 % 2 = 0 n=62698428,  Noun->NS _space-> NSS Vowel _space
62698428 % 2 = 0  n=31349214,  NSS->j
31349214 % 6 = 0  n=5224869,   Vowel->a
5224869 % 2 = 1   n=2612434,   T-Verb->TVS T-Verb->TVSS Vowel T-Verb
2612434 % 2 = 0   n=1306217,   TVSS->p
1306217 % 6 = 5   n=217702,    Vowel->y
217702 % 2 = 0    n=108851,    T-Verb->TVSS Vowel _space
108851 % 2 = 1    n=54425,     TVSS->s
54425 % 6 = 5     n=9070,      Vowel->y
9070 % 2 = 0      n=4535,      Noun->NSS Vowel _space
4535 % 2 = 1      n=2267       NSS->v
2267 % 6 = 5      n=377        Vowel->y
377 % 3 = 2       n=125        Punct->?
125 % 2 = 1       n=62         S->__capital Noun T-Verb Noun Punct
62 % 2 = 0        n=31         Noun->NSS Vowel _space
31 % 2 = 1        n=15         NSS->v
15 % 6 = 3        n=2          Vowel->o
2 % 2 = 0         n=1          T-Verb->TVSS Vowel _space
1 % 2 = 1         n=0          TVSS->p
                  n=0          Vowel _space Noun Punct -> "a ja." 

This yields: "Ja pysy vy? Vo pa ja." from 08F734F7 (note that my print routine removes spaces before punctuation)

like image 38
Eponymous Avatar answered Oct 19 '22 07:10

Eponymous


This is an old question, but very interesting.

Once I wanted to do a similar conversion, but having other goals. Guid (uuids) are usually not eye-friendly so I had to convert it to a plausible words. The final system was based on occurrence of English letter after two preceding ones. This table was made using a corpus of English sentences and the ones that was used too rarely was excluded. So the final table contained lines looking like

  ...
  (key:'_t'; next:'aehioruwy'),
  (key:'_u'; next:'lmnprst'),
  (key:'_w'; next:'aehiory'),
  (key:'ab'; next:'abeilorsuwy'),
  (key:'ac'; next:'_acehikloqrtuy'),
  ...

containing about 200-300 lines, where 'next' is all possible letters that can appear after 'key' letters (_ is the begin or end of the word depending on whether it's in the key or in next).

The process of conversion took the current value, divide it modulo length(next) and take remainder as the corresponding letter as the next 'plausible' symbol, quotient becomes new current value. To avoid long words there was a trick to explicitly end words used symmetrically by encoding and decoding. This system could produce for example such sequences (input for each is 128-bit guid/uuid)

Furepas_Wann_Hunkare_Rylacid_Makinuag_Dreem

Togo_Ragam_Omb_Bonsbe_Gonn_Eclecki_Op

or if we take some widely used guids, for example MS IWebBrowser2 {D30C1661-CDAF-11D0-8A3E-00C04FC9E26E}

Lakar_Rupplex_Waylagit_Munghim_Paddato_Molu 

("Lakar Rupplex" is a good human name for a browser, isn't it?)

As for the density, this system gave about 3 bits per letter density.

like image 1
Maksee Avatar answered Oct 19 '22 07:10

Maksee