I have a unicode string retrieved from a webservice using the requests
module, which contains the bytes of a binary document (PCL, as it happens). One of these bytes has the value 248, and attempting to base64 encode it leads to the following error:
In [68]: base64.b64encode(response_dict['content']+'\n')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
C:\...\<ipython-input-68-8c1f1913eb52> in <module>()
----> 1 base64.b64encode(response_dict['content']+'\n')
C:\Python27\Lib\base64.pyc in b64encode(s, altchars)
51 """
52 # Strip off the trailing newline
---> 53 encoded = binascii.b2a_base64(s)[:-1]
54 if altchars is not None:
55 return _translate(encoded, {'+': altchars[0], '/': altchars[1]})
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 272: ordinal not in range(128)
In [69]: response_dict['content'].encode('base64')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
C:\...\<ipython-input-69-7fd349f35f04> in <module>()
----> 1 response_dict['content'].encode('base64')
C:\...\base64_codec.pyc in base64_encode(input, errors)
22 """
23 assert errors == 'strict'
---> 24 output = base64.encodestring(input)
25 return (output, len(input))
26
C:\Python27\Lib\base64.pyc in encodestring(s)
313 for i in range(0, len(s), MAXBINSIZE):
314 chunk = s[i : i + MAXBINSIZE]
--> 315 pieces.append(binascii.b2a_base64(chunk))
316 return "".join(pieces)
317
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 44: ordinal not in range(128)
I find this slightly surprising, because 248 is within the range of an unsigned byte (and can be held in a byte string), but my real question is: what is the best or right way to encode this string?
My current work-around is this:
In [74]: byte_string = ''.join(map(compose(chr, ord), response_dict['content']))
In [75]: byte_string[272]
Out[75]: '\xf8'
This appears to work correctly, and the resulting byte_string
is capable of being base64 encoded, but it seems like there should be a better way. Is there?
To convert a string into a Base64 character the following steps should be followed: Get the ASCII value of each character in the string. Compute the 8-bit binary equivalent of the ASCII values. Convert the 8-bit characters chunk into chunks of 6 bits by re-grouping the digits.
If we were to Base64 encode a string we would follow these steps: Take the ASCII value of each character in the string. Calculate the 8-bit binary equivalent of the ASCII values. Convert the 8-bit chunks into chunks of 6 bits by simply re-grouping the digits.
Anything that you paste or enter in the text area on the left automatically gets encoded to base64 on the right. It supports the most popular Unicode encodings (such as UTF-8, UTF-16, UCS-2, UTF-32, and UCS-4) and it works with emoji characters. You can also adjust the output base64 line length.
In python, the unicode type stores an abstract sequence of code points. Each code point represents a grapheme. By contrast, byte str stores a sequence of bytes which can then be mapped to a sequence of code points.
Since you are working with binary data, I'm not sure that it's a good idea to use the utf-8 encoding. I guess it depends on how you intend to use the base64 encoded representation. I think it would probably be better if you can retrieve the data as a bytes string and not a unicode string. I have never used the requests library, but browsing the documentation suggests that it is possible. There are sections talking about "Binary Response Content" and "Raw Response Content".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With