Let's say I have the following:
image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""
This is just a dot image (from https://en.wikipedia.org/wiki/Data_URI_scheme). But I do not know if it is image or text etc. Is it possible to understand what it is only having this encoded string? I try it in Python, but it is also general question. So any insight in both is highly welcome.
Show activity on this post. In base64 encoding, the character set is [A-Z, a-z, 0-9, and + /] . If the rest length is less than 4, the string is padded with '=' characters. ^([A-Za-z0-9+/]{4})* means the string starts with 0 or more base64 groups.
All you need to do is decode, then re-encode. If the re-encoded string is equal to the encoded string, then it is base64 encoded. That's it!
To decode an image using Python, we simply use the base64. b64decode(s) function. Python mentions the following regarding this function: Decode the Base64 encoded bytes-like object or ASCII string s and return the decoded bytes.
In Python the base64 module is used to encode and decode data. First, the strings are converted into byte-like objects and then encoded using the base64 module. The below example shows the implementation of encoding strings isn't base64 characters.
You can't, at least not without decoding, because the bytes that help identify the filetype are spread across the base64 characters, which don't directly align with whole bytes. Each character encodes 6 bits, which means that for every 4 characters, there are 3 bytes encoded.
Identifying a filetype requires access to those bytes in different block sizes. A JPEG image for example, can be identified from the bytes FF D8 or FF D9, but that's two bytes; the third byte that follows must also be encoded as part of the 4-character block.
What you can do is decode just enough of the base64 string to do your filetype fingerprinting. So you can decode the first 4 characters to get the 3 bytes, and then use the first two to see if the object is a JPEG image. A large number of file formats can be identified from just the first or last series of bytes (a PNG image can be identified by the first 8 bytes, a GIF by the first 6, etc.). Decoding just those bytes from the base64 string is trivial.
Your sample is a PNG image; you can test for image types using the imghdr
module:
>>> import imghdr
>>> image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""
>>> sample = image_data[:44].decode('base64') # 33 bytes / 3 times 4 is 44 base64 chars
>>> for tf in imghdr.tests:
... res = tf(sample, None)
... if res:
... break
...
>>> print res
png
I only used the first 33 bytes from the base64 data, to echo what the imghdr.what()
function will read from the file you pass it (it reads 32 bytes, but that number doesn't divide by 3).
There is an equivalent soundhdr
module, and there is also the python-magic
project that lets you pass in a number of bytes to determine a file type.
Of course, you can. There are few extremely easy approaches to the problem I can think of:
Each base64 character encodes 6 bits of input, so you can relate them as follows:
Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH
Data: xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee
If you would like to extract 4 bytes of data, starting with offset 1, like this:
................................
Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH
Data: xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Then, to decode only parts that you want, you need to know bit distances. They are easy to calculate, just multiply your byte distances by 8. Now, after you know you want 32 bits, starting with bit 8, you can find what base64 character contains your starting bits. To do that, divmod your offset
and offset+length
by 6:
start = bit 8 = char 1 + bit 2
end = bit 40 = char 6 + bit 4
Well, this maps to the scheme above — your span starts after 1 full base64 char and 2 bits, and ends after 6 full base64 chars and 4 bits.
Now, after you know exact base64 chars you want, you need to decode them. To do that, it makes sense to leverage existing base64 decoders, so we won't need to deal with base64 encoding yourself. And to do that, you should know that each 4 chars of base64 code correspond to 3 bytes of data. So, here goes the trick — you can prepend and append gibberish to your extracted base64 code, until base64 and byte boundaries align — and knowing how much invalid input will base64 decoder produce, throw out excess.
So, how much to prepend depends on value of bit remainder. If start bit remainder is 0, it means that A
and x
are aligned, so no changes required:
|==========================
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data: ...xxxxxxxxyyyyyyyyzzzzzzzz...
|==========================
If bit remainder is 2, you need to prepend one base64 char, and throw out one leading byte after decoding:
##|==================
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data: ...xxxxxxxxyyyyyyyyzzzzzzzz...
|==================
If bit remainder is 4, you need to prepend two base64 chars, and throw out two leading bytes after decoding:
####|==========
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data: ...xxxxxxxxyyyyyyyyzzzzzzzz...
|==========
Same goes for trailing. If end bit remainder is zero, no changes:
===|
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data: ...xxxxxxxxyyyyyyyyzzzzzzzz...
===|
If end bit remainder is 2, you need to append two base64 chars, and throw out two trailing bytes:
=========##|
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data: ...xxxxxxxxyyyyyyyyzzzzzzzz...
===========|
If end bit remainder is 4, you need to append one base64 char, and throw out one trailing byte:
===============####|
Base64: ...AAAAAABBBBBBCCCCCCDDDDDD...
Data: ...xxxxxxxxyyyyyyyyzzzzzzzz...
===================|
So, for synthetic example above, one character needs to be prepended (instead of A
), and one character appended (in place of H
):
................................
Base64: ??????BBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGG??????
Data: ????????yyyyyyyyzzzzzzzzqqqqqqqqwwwwwwww????????
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, after decoding, throw out extra bytes from head and tail and you're done.
Imagine you have a magic like ?PNG\r\n??????IHDR
. Then, to check if base64-coded string matches your magic, you can identify bytes in magic that are known, and their bit offsets and lengths:
"PNG\r\n" -> offset = 8, length = 40
"IHDR" -> offset = 96, length = 32
So, using our ideas from above:
"PNG\r\n" -> start = 8 ( char 1, bits = 2 ), end = 48 ( char 8, bits = 0 )
"IHDR" -> start = 96 ( char 16, bits = 0 ), end = 128 ( char 21, bits = 2 )
To decode "PNG\r\n"
part, you need to take 7 full base64 chars, starting with char 1, then prepend 1 char, decode, throw out 1 leading byte and compare.
To decode "IHDR"
part, you need to take 6 base64 chars, starting with char 16, then append 2 chars, decode, throw out 2 trailing bytes and compare.
Alternative approach to what I had described above is instead of translating data, translate magics themselves.
So, if you have magic ?PNG\r\n??????IHDR
(I had replaced \r
and \n
for presentation purposes), like in an example above, when encoded to base64, it looks like this:
Data: [?PN] [Grn] [???] [???] [IHD] [R??]
Base64: (?~BO) (Rw0K) (????) (????) (SUhE) (Ug==)
In ?~BO
part, ~
sign is only partially random. Let's look at that construct bitwise:
Data: ????????PPPPPPPPNNNNNNNN
Base64: ??????~~~~~~BBBBBBOOOOOO
So, only two lower bits of ~
are truly unknown, and that means that you can use that information while testing magic against the data, to narrow the scope of magic.
For this particular case, here is exhaustive list of all encodings:
Data: ??????00PPPPPPPPNNNNNNNN
Base64: ??????FFFFFFBBBBBBOOOOOO => ?FBO
Data: ??????01PPPPPPPPNNNNNNNN
Base64: ??????VVVVVVBBBBBBOOOOOO => ?VBO
Data: ??????10PPPPPPPPNNNNNNNN
Base64: ??????llllllBBBBBBOOOOOO => ?lBO
Data: ??????11PPPPPPPPNNNNNNNN
Base64: ??????111111BBBBBBOOOOOO => ?1BO
Same applies to trailing R??
group, but because there are 4 undefined bits instead of 2, permutation list is longer:
Ug?? <= 0000???? ????????
Uh?? <= 0001???? ????????
Ui?? <= 0010???? ????????
Uj?? <= 0011???? ????????
Uk?? <= 0100???? ????????
Ul?? <= 0101???? ????????
Um?? <= 0110???? ????????
Un?? <= 0111???? ????????
Uo?? <= 1000???? ????????
Up?? <= 1001???? ????????
Uq?? <= 1010???? ????????
Ur?? <= 1011???? ????????
Us?? <= 1100???? ????????
Ut?? <= 1101???? ????????
Uu?? <= 1110???? ????????
Uv?? <= 1111???? ????????
So, in regexp, your base64-magic for ?PNG\r\n??????IHDR
would look like this:
rx = re.compile(b'^.[FVl1]BORw0K........SUhEU[g-v]')
if rx.match(base64.b64encode(b'xPNG\r\n123456IHDR789foobar')):
print('Yep, it works!')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With