Base64 encoding schemes are commonly used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with ASCII. This is to ensure that the data remain intact without modification during transport.
If we were to Base64 encode a string we would follow these steps: Take the ASCII value of each character in the string. Calculate the 8-bit binary equivalent of the ASCII values. Convert the 8-bit chunks into chunks of 6 bits by simply re-grouping the digits.
From Wikipedia: The final '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
No, you would need to url-encode it, since base64 strings can contain the "+", "=" and "/" characters which could alter the meaning of your data - look like a sub-folder.
base64 encoding takes 8-bit binary byte data and encodes it uses only the characters A-Z
, a-z
, 0-9
, +
, /
* so it can be transmitted over channels that do not preserve all 8-bits of data, such as email.
Hence, it wants a string of 8-bit bytes. You create those in Python 3 with the b''
syntax.
If you remove the b
, it becomes a string. A string is a sequence of Unicode characters. base64 has no idea what to do with Unicode data, it's not 8-bit. It's not really any bits, in fact. :-)
In your second example:
>>> encoded = base64.b64encode('data to be encoded')
All the characters fit neatly into the ASCII character set, and base64 encoding is therefore actually a bit pointless. You can convert it to ascii instead, with
>>> encoded = 'data to be encoded'.encode('ascii')
Or simpler:
>>> encoded = b'data to be encoded'
Which would be the same thing in this case.
* Most base64 flavours may also include a =
at the end as padding. In addition, some base64 variants may use characters other than +
and /
. See the Variants summary table at Wikipedia for an overview.
You need to push a bytes-like
object (bytes
, bytearray
, etc) to the base64.b64encode()
method. Here are two ways:
>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Or with a variable:
>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
In Python 3, str
objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string
and call the .encode()
method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.
Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.
base64
encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.
In base64
encoding, the translation is done from left to right; those first 64 characters are why it is called base64
encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.
Example:
>>> data = b'test'
>>> for byte in data:
... print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>
If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):
base-2: 01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10: 1952805748
base-64: B 0 Z X N 0
base64
encoding, however, will re-group this data thusly:
base-2: 011101 000110 010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10: 29 6 21 51 29 0
base-64: d G V z d A
So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64
encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64
encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.
Let's test this to see if I am being dishonest:
>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='
base64
encoding?Let's say I have to send some data to someone via email, like this data:
>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
>>> print(data)
b'\x04msg\x08\x08\x08 '
>>>
There are two problems I planted:
\x04
character was read, because that is ASCII for END-OF-TRANSMISSION
(Ctrl-D), so the remaining data would be left out of the transmission.BACKSPACE
characters and three SPACE
characters to erase the 'msg'. Thus, even if I didn't have the EOF
character there the end user wouldn't be able to translate from the text on screen to the real, raw data.This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.
If the data to be encoded contains "exotic" characters, I think you have to encode in "UTF-8"
encoded = base64.b64encode (bytes('data to be encoded', "utf-8"))
If the string is Unicode the easiest way is:
import base64
a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))
# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'
b = base64.b64decode(a).decode("utf-8", "ignore")
print(b)
# b :complex string: ñáéíóúÑ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With