Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

base64 encoding unicode strings in python 2.7

I have a unicode string retrieved from a webservice using the requests module, which contains the bytes of a binary document (PCL, as it happens). One of these bytes has the value 248, and attempting to base64 encode it leads to the following error:

In [68]: base64.b64encode(response_dict['content']+'\n')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-68-8c1f1913eb52> in <module>()
----> 1 base64.b64encode(response_dict['content']+'\n')

C:\Python27\Lib\base64.pyc in b64encode(s, altchars)
     51     """
     52     # Strip off the trailing newline
---> 53     encoded = binascii.b2a_base64(s)[:-1]
     54     if altchars is not None:
     55         return _translate(encoded, {'+': altchars[0], '/': altchars[1]})

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 272: ordinal not in range(128)

In [69]: response_dict['content'].encode('base64')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-69-7fd349f35f04> in <module>()
----> 1 response_dict['content'].encode('base64')

C:\...\base64_codec.pyc in base64_encode(input, errors)
     22     """
     23     assert errors == 'strict'
---> 24     output = base64.encodestring(input)
     25     return (output, len(input))
     26

C:\Python27\Lib\base64.pyc in encodestring(s)
    313     for i in range(0, len(s), MAXBINSIZE):
    314         chunk = s[i : i + MAXBINSIZE]
--> 315         pieces.append(binascii.b2a_base64(chunk))
    316     return "".join(pieces)
    317

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 44: ordinal not in range(128)

I find this slightly surprising, because 248 is within the range of an unsigned byte (and can be held in a byte string), but my real question is: what is the best or right way to encode this string?

My current work-around is this:

In [74]: byte_string = ''.join(map(compose(chr, ord), response_dict['content']))

In [75]: byte_string[272]
Out[75]: '\xf8'

This appears to work correctly, and the resulting byte_string is capable of being base64 encoded, but it seems like there should be a better way. Is there?

like image 267
Marcin Avatar asked Mar 05 '12 18:03

Marcin


People also ask

How do I encode strings to base64 in Python?

To convert a string into a Base64 character the following steps should be followed: Get the ASCII value of each character in the string. Compute the 8-bit binary equivalent of the ASCII values. Convert the 8-bit characters chunk into chunks of 6 bits by re-grouping the digits.

How do I get base64 encoded strings?

If we were to Base64 encode a string we would follow these steps: Take the ASCII value of each character in the string. Calculate the 8-bit binary equivalent of the ASCII values. Convert the 8-bit chunks into chunks of 6 bits by simply re-grouping the digits.

Does base64 support Unicode?

Anything that you paste or enter in the text area on the left automatically gets encoded to base64 on the right. It supports the most popular Unicode encodings (such as UTF-8, UTF-16, UCS-2, UTF-32, and UCS-4) and it works with emoji characters. You can also adjust the output base64 line length.

What is Unicode in Python 2?

In python, the unicode type stores an abstract sequence of code points. Each code point represents a grapheme. By contrast, byte str stores a sequence of bytes which can then be mapped to a sequence of code points.


1 Answers

Since you are working with binary data, I'm not sure that it's a good idea to use the utf-8 encoding. I guess it depends on how you intend to use the base64 encoded representation. I think it would probably be better if you can retrieve the data as a bytes string and not a unicode string. I have never used the requests library, but browsing the documentation suggests that it is possible. There are sections talking about "Binary Response Content" and "Raw Response Content".

like image 108
Dan Gerhardsson Avatar answered Oct 05 '22 04:10

Dan Gerhardsson