Python, can someone guess the type of a file only by its base64 encoding?

Tags:

base64

Let's say I have the following:

image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""

This is just a dot image (from https://en.wikipedia.org/wiki/Data_URI_scheme). But I do not know if it is image or text etc. Is it possible to understand what it is only having this encoded string? I try it in Python, but it is also general question. So any insight in both is highly welcome.

262

asked Dec 15 '15 11:12

george

2 Answers

You can't, at least not without decoding, because the bytes that help identify the filetype are spread across the base64 characters, which don't directly align with whole bytes. Each character encodes 6 bits, which means that for every 4 characters, there are 3 bytes encoded.

Identifying a filetype requires access to those bytes in different block sizes. A JPEG image for example, can be identified from the bytes FF D8 or FF D9, but that's two bytes; the third byte that follows must also be encoded as part of the 4-character block.

What you can do is decode just enough of the base64 string to do your filetype fingerprinting. So you can decode the first 4 characters to get the 3 bytes, and then use the first two to see if the object is a JPEG image. A large number of file formats can be identified from just the first or last series of bytes (a PNG image can be identified by the first 8 bytes, a GIF by the first 6, etc.). Decoding just those bytes from the base64 string is trivial.

Your sample is a PNG image; you can test for image types using the imghdr module:

>>> import imghdr
>>> image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""
>>> sample = image_data[:44].decode('base64')  # 33 bytes / 3 times 4 is 44 base64 chars
>>> for tf in imghdr.tests:
...     res = tf(sample, None)
...     if res:
...         break
...
>>> print res
png

I only used the first 33 bytes from the base64 data, to echo what the imghdr.what() function will read from the file you pass it (it reads 32 bytes, but that number doesn't divide by 3).

There is an equivalent soundhdr module, and there is also the python-magic project that lets you pass in a number of bytes to determine a file type.

148

answered Oct 09 '22 00:10

Martijn Pieters

Of course, you can. There are few extremely easy approaches to the problem I can think of:

Partial decode

Each base64 character encodes 6 bits of input, so you can relate them as follows:

Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH
Data:   xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee

If you would like to extract 4 bytes of data, starting with offset 1, like this:

                ................................
Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH
Data:   xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Then, to decode only parts that you want, you need to know bit distances. They are easy to calculate, just multiply your byte distances by 8. Now, after you know you want 32 bits, starting with bit 8, you can find what base64 character contains your starting bits. To do that, divmod your offset and offset+length by 6:

start = bit  8 = char 1 + bit 2
end   = bit 40 = char 6 + bit 4

Well, this maps to the scheme above — your span starts after 1 full base64 char and 2 bits, and ends after 6 full base64 chars and 4 bits.

Now, after you know exact base64 chars you want, you need to decode them. To do that, it makes sense to leverage existing base64 decoders, so we won't need to deal with base64 encoding yourself. And to do that, you should know that each 4 chars of base64 code correspond to 3 bytes of data. So, here goes the trick — you can prepend and append gibberish to your extracted base64 code, until base64 and byte boundaries align — and knowing how much invalid input will base64 decoder produce, throw out excess.

So, how much to prepend depends on value of bit remainder. If start bit remainder is 0, it means that A and x are aligned, so no changes required:

           |==========================
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data:   ...xxxxxxxxyyyyyyyyzzzzzzzz...
           |==========================

If bit remainder is 2, you need to prepend one base64 char, and throw out one leading byte after decoding:

                 ##|==================
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data:   ...xxxxxxxxyyyyyyyyzzzzzzzz...
                   |==================

If bit remainder is 4, you need to prepend two base64 chars, and throw out two leading bytes after decoding:

                       ####|==========
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data:   ...xxxxxxxxyyyyyyyyzzzzzzzz...
                           |==========

Same goes for trailing. If end bit remainder is zero, no changes:

        ===|
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data:   ...xxxxxxxxyyyyyyyyzzzzzzzz...
        ===|

If end bit remainder is 2, you need to append two base64 chars, and throw out two trailing bytes:

        =========##|
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data:   ...xxxxxxxxyyyyyyyyzzzzzzzz...
        ===========|

If end bit remainder is 4, you need to append one base64 char, and throw out one trailing byte:

        ===============####|
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data:   ...xxxxxxxxyyyyyyyyzzzzzzzz...
        ===================|

So, for synthetic example above, one character needs to be prepended (instead of A), and one character appended (in place of H):

                ................................
Base64: ??????BBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGG??????
Data:   ????????yyyyyyyyzzzzzzzzqqqqqqqqwwwwwwww????????
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, after decoding, throw out extra bytes from head and tail and you're done.

Practical example

Imagine you have a magic like ?PNG\r\n??????IHDR. Then, to check if base64-coded string matches your magic, you can identify bytes in magic that are known, and their bit offsets and lengths:

"PNG\r\n"  ->  offset =  8, length = 40
"IHDR"     ->  offset = 96, length = 32

So, using our ideas from above:

"PNG\r\n"  ->  start =  8 ( char  1, bits = 2 ), end = 48  ( char 8, bits = 0 )
"IHDR"     ->  start = 96 ( char 16, bits = 0 ), end = 128 ( char 21, bits = 2 )

To decode "PNG\r\n" part, you need to take 7 full base64 chars, starting with char 1, then prepend 1 char, decode, throw out 1 leading byte and compare.

To decode "IHDR" part, you need to take 6 base64 chars, starting with char 16, then append 2 chars, decode, throw out 2 trailing bytes and compare.

Translate magics

Alternative approach to what I had described above is instead of translating data, translate magics themselves.

So, if you have magic ?PNG\r\n??????IHDR (I had replaced \r and \n for presentation purposes), like in an example above, when encoded to base64, it looks like this:

Data:   [?PN]  [Grn]  [???]  [???]  [IHD]  [R??]
Base64: (?~BO) (Rw0K) (????) (????) (SUhE) (Ug==)

In ?~BO part, ~ sign is only partially random. Let's look at that construct bitwise:

Data:   ????????PPPPPPPPNNNNNNNN
Base64: ??????~~~~~~BBBBBBOOOOOO

So, only two lower bits of ~ are truly unknown, and that means that you can use that information while testing magic against the data, to narrow the scope of magic.

For this particular case, here is exhaustive list of all encodings:

Data:   ??????00PPPPPPPPNNNNNNNN
Base64: ??????FFFFFFBBBBBBOOOOOO  => ?FBO

Data:   ??????01PPPPPPPPNNNNNNNN
Base64: ??????VVVVVVBBBBBBOOOOOO  => ?VBO

Data:   ??????10PPPPPPPPNNNNNNNN
Base64: ??????llllllBBBBBBOOOOOO  => ?lBO

Data:   ??????11PPPPPPPPNNNNNNNN
Base64: ??????111111BBBBBBOOOOOO  => ?1BO

Same applies to trailing R?? group, but because there are 4 undefined bits instead of 2, permutation list is longer:

Ug??  <=  0000???? ????????
Uh??  <=  0001???? ????????
Ui??  <=  0010???? ????????
Uj??  <=  0011???? ????????
Uk??  <=  0100???? ????????
Ul??  <=  0101???? ????????
Um??  <=  0110???? ????????
Un??  <=  0111???? ????????
Uo??  <=  1000???? ????????
Up??  <=  1001???? ????????
Uq??  <=  1010???? ????????
Ur??  <=  1011???? ????????
Us??  <=  1100???? ????????
Ut??  <=  1101???? ????????
Uu??  <=  1110???? ????????
Uv??  <=  1111???? ????????

So, in regexp, your base64-magic for ?PNG\r\n??????IHDR would look like this:

rx = re.compile(b'^.[FVl1]BORw0K........SUhEU[g-v]')
if rx.match(base64.b64encode(b'xPNG\r\n123456IHDR789foobar')):
    print('Yep, it works!')

answered Oct 09 '22 00:10

toriningen

Related questions
                            
                                Django: skip system check when running custom command
                            
                                Flask, not all arguments converted during string formatting
                            
                                Pythonic way to generate string rotations
                            
                                Python intersection of 2 lists of dictionaries
                            
                                How to get the text from a checkbutton in python ? (Tkinter)
                            
                                Reading PNG with PIL in Python
                            
                                Seaborn ticklabels are being truncated
                            
                                Create heatmap in python matplotlib with x and y labels from dict with {tuple:float} format
                            
                                Trouble deleting certain nested JSON objects in python
                            
                                Libxml2 installation onto Mac
                            
                                How to scrape dynamic webpages by Python
                            
                                How can I select n items and skip m from ndarray in python?
                            
                                Can I create list from regular expressions?
                            
                                youtube-dl: setting metadata attributes and embedding thumbnail from python?
                            
                                Merging dictionary keys if values the same
                            
                                Plot a pandas dataframe grouped by column
                            
                                AWS DynamoDB - Load data with Boto3 using JSON file as input
                            
                                TensorFlow MLP not training XOR
                            
                                Handling unassigned (null) values of features in regression (machine learning)?
                            
                                installing pygame on python3.5 osx

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With