I've got a Python program that stores and writes data to a file. The data is raw binary data, stored internally as <code>str</code>. I'm writing it out through a utf-8 codec. However, I get <code>UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 25: character maps to <undefined></code> in the <code>cp1252.py</code> file. This looks to me like Python is trying to interpret the data using the default code page. But it doesn't have a default code page. That's why I'm using <code>str</code>, not <code>unicode</code>. I guess my questions are: <ul> <li>How do I represent raw binary data in memory, in Python?</li> <li>When I'm writing raw binary data out through a codec, how do I encode/unencode it?</li> </ul>

NOTE: this was written for Python 2.x. Not sure if applicable to 3.x. Your use of <code>str</code> for raw binary data in memory is correct. [If you're using Python 2.6+, it's even better to use <code>bytes</code> which in 2.6+ is just an alias to <code>str</code> but expresses your intention better, and will help if one day you port the code to Python 3.] As others note, writing binary data through a codec is strange. A write codec takes unicode and outputs bytes into the file. You're trying to do it backwards, hence our confusion about your intentions... [And your diagnosis of the error looks correct: since the codec expects unicode, Python is decoding your str into unicode with the system's default encoding, which chokes.] What you want to see in the output file? <ul> <li> If the file should contain the binary data as-is: Then you must not send it through a codec; you must write it directly to the file. A codec encodes everything and can only emit valid encodings of unicode (in your case, valid UTF-8). There is no input you can give it to make it emit arbitrary byte sequences! <ul> <li>If you require a mixture of UTF-8 and raw binary data, you should open the file directly, and intermix writes of <code>some_data</code> with <code>some_text.encode('utf8')</code>...</li> </ul> Note however that mixing UTF-8 with raw arbitrary data is very bad design, because such files are very inconvenient to deal with! Tools that understand unicode will choke on the binary data, leaving you with not convenient way to even view (let alone modify) the file. </li> <li> If you want a friendly representation of arbitrary bytes in unicode: Pass <code>data.encode('base64')</code> to the codec. Base64 produces only clean ascii (letters, numbers, and a little punctuation) so it can be clearly embedded in anything, it clearly looks to people as binary data, and it's reasonably compact (slightly over 33% overhead). P.S. you may note that <code>data.encode('base64')</code> is strange. <ul> <li><code>.encode()</code> is supposed to take unicode but I'm giving it a string?! Python has several pseudo-codecs that convert str->str such as 'base64' and 'zlib'.</li> <li><code>.encode()</code> always returns an str but you'll feed it into a codec expecting unicode?! In this case it will only contain clean ascii, so it doesn't matter. You may write explicitly <code>data.encode('base64').encode('utf8')</code> if it makes you feel better.</li> </ul> </li> <li> If you need a 1:1 mapping from arbitrary bytes to unicode: Pass <code>data.decode('latin1')</code> to the codec. <code>latin1</code> maps bytes 0-255 to unicode characters 0-255, which is kinda elegant. The codec will, of course, encode your characters - 128-255 are encoded as 2 or 3 bytes in UTF-8 (surprisingly, the average overhead is 50%, more than base64!). This quite kills the "elegance" of having a 1:1 mapping. Note also that unicode characters 0-255 include nasty invisible/control characters (newline, formfeed, soft hyphen, etc.) making your binary data annoying to view in text editors. Considering these drawbacks, I do not recommend latin1 unless you understand exactly why you want it. I'm just mentioning it as the other "natural" encoding that springs to mind. </li> </ul>

How do I write raw binary data in Python?

Tags:

python

string

codec

I've got a Python program that stores and writes data to a file. The data is raw binary data, stored internally as str. I'm writing it out through a utf-8 codec. However, I get UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 25: character maps to <undefined> in the cp1252.py file.

This looks to me like Python is trying to interpret the data using the default code page. But it doesn't have a default code page. That's why I'm using str, not unicode.

I guess my questions are:

How do I represent raw binary data in memory, in Python?
When I'm writing raw binary data out through a codec, how do I encode/unencode it?

253

asked Apr 09 '10 21:04

Chris B.

1 Answers

NOTE: this was written for Python 2.x. Not sure if applicable to 3.x.

Your use of str for raw binary data in memory is correct.
[If you're using Python 2.6+, it's even better to use bytes which in 2.6+ is just an alias to str but expresses your intention better, and will help if one day you port the code to Python 3.]

As others note, writing binary data through a codec is strange. A write codec takes unicode and outputs bytes into the file. You're trying to do it backwards, hence our confusion about your intentions...

[And your diagnosis of the error looks correct: since the codec expects unicode, Python is decoding your str into unicode with the system's default encoding, which chokes.]

What you want to see in the output file?

If the file should contain the binary data as-is:

Then you must not send it through a codec; you must write it directly to the file. A codec encodes everything and can only emit valid encodings of unicode (in your case, valid UTF-8). There is no input you can give it to make it emit arbitrary byte sequences!
- If you require a mixture of UTF-8 and raw binary data, you should open the file directly, and intermix writes of some_data with some_text.encode('utf8')...
Note however that mixing UTF-8 with raw arbitrary data is very bad design, because such files are very inconvenient to deal with! Tools that understand unicode will choke on the binary data, leaving you with not convenient way to even view (let alone modify) the file.
If you want a friendly representation of arbitrary bytes in unicode:

Pass data.encode('base64') to the codec. Base64 produces only clean ascii (letters, numbers, and a little punctuation) so it can be clearly embedded in anything, it clearly looks to people as binary data, and it's reasonably compact (slightly over 33% overhead).

P.S. you may note that data.encode('base64') is strange.
- .encode() is supposed to take unicode but I'm giving it a string?! Python has several pseudo-codecs that convert str->str such as 'base64' and 'zlib'.
- .encode() always returns an str but you'll feed it into a codec expecting unicode?! In this case it will only contain clean ascii, so it doesn't matter. You may write explicitly data.encode('base64').encode('utf8') if it makes you feel better.
If you need a 1:1 mapping from arbitrary bytes to unicode:

Pass data.decode('latin1') to the codec. latin1 maps bytes 0-255 to unicode characters 0-255, which is kinda elegant.

The codec will, of course, encode your characters - 128-255 are encoded as 2 or 3 bytes in UTF-8 (surprisingly, the average overhead is 50%, more than base64!). This quite kills the "elegance" of having a 1:1 mapping.

Note also that unicode characters 0-255 include nasty invisible/control characters (newline, formfeed, soft hyphen, etc.) making your binary data annoying to view in text editors.

Considering these drawbacks, I do not recommend latin1 unless you understand exactly why you want it.
I'm just mentioning it as the other "natural" encoding that springs to mind.

183

answered Sep 22 '22 21:09

Beni Cherniavsky-Paskin

Related questions
                            
                                Post numpy array with json to flask app with requests
                            
                                Python - Find the closest color to a color, from giving list of colors
                            
                                NLTK available languages for stopwords
                            
                                RuntimeError: asyncio.run() cannot be called from a running event loop
                            
                                How to set the columns in pandas
                            
                                raise ValueError('Fileobj must implement read')
                            
                                FastAPI middleware peeking into responses
                            
                                How knowing number of images in flow_from_directory
                            
                                From Excel to list of tuples
                            
                                'float' object has no attribute 'round'
                            
                                TypeError: descriptor 'date' for 'datetime.datetime' objects doesn't apply to a 'int' object
                            
                                How do I prevent Python's os.walk from walking across mount points?
                            
                                Make a python property with the same name as the class member name
                            
                                Why does SCons VariantDir() not put output in the given directory?
                            
                                Best way to save complex Python data structures across program sessions (pickle, json, xml, database, other)
                            
                                Does the TCPServer + BaseRequestHandler in Python's SocketServer close the socket after each call to handle()?
                            
                                In Jinja2, how can I use macros in combination with block tags?
                            
                                Extending a list of lists in Python? [duplicate]
                            
                                webapp, tipfy or django on google app engine [closed]
                            
                                Regex for [a-zA-Z0-9\-] with dashes allowed in between but not at the start or end

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With