I have a bunch of binary data that comes to python via a char* from some C interface (not under my control) so I have a string of arbitrary binary data (what is normally a byte array). I would like to convert it to a byte array to simplify using it with other python functions but I can't seem to figure out how. Examples that don't work: <code>data = rawdatastr.encode()</code> this assumes "utf-8" and mangles the data == BAD <code>data = rawdatastr.encode('ascii','ignore')</code> strips chars over 127 == BAD <code>data = rawdatastr.encode('latin1')</code> not sure -- this is the closest so far but I have no proof that it is working for all bytes. <code>data = array.array('B', [x for x in map(ord,data)]).tobytes()</code> This works but seems like a lot of work to do something simple. Is there something simpler? I am thinking I need to write my own identity encoding that just passes the bytes along (I think latin1 does this based upon some reading but no proof thus far).

Though I suspect something else is decoding your data for you (a <code>char*</code> in C is usually best represented as <code>bytes</code>, especially if it is binary data): The <code>latin1</code> codec can round trip every byte. You can verify this with the following short program: <pre class="prettyprint"><code>>>> s = ''.join(chr(i) for i in range(0x100)) >>> s '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ' >>> s2 = s.encode('latin1').decode('latin1') >>> s2 == s True >>> sb = bytes(range(0x100)) >>> sb b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' >>> sb == s.encode('latin1') True </code></pre>

Just now I ran into the same problem. This is what I came up with: <pre class="prettyprint"><code>import struct def rawbytes(s): """Convert a string to raw bytes without encoding""" outlist = [] for cp in s: num = ord(cp) if num < 255: outlist.append(struct.pack('B', num)) elif num < 65535: outlist.append(struct.pack('>H', num)) else: b = (num & 0xFF0000) >> 16 H = num & 0xFFFF outlist.append(struct.pack('>bH', b, H)) return b''.join(outlist) </code></pre> Some examples: <pre class="prettyprint"><code>In [34]: rawbytes('this is a test') Out[34]: b'this is a test' In [35]: rawbytes('\udc80\udcdf\udcff\udcff\udcff\x7f') Out[35]: b'\xdc\x80\xdc\xdf\xdc\xff\xdc\xff\xdc\xff\x7f' </code></pre>

How to cast a string to bytes without encoding

Tags:

python-3.x

character-encoding

encoding

I have a bunch of binary data that comes to python via a char* from some C interface (not under my control) so I have a string of arbitrary binary data (what is normally a byte array). I would like to convert it to a byte array to simplify using it with other python functions but I can't seem to figure out how.

Examples that don't work:

data = rawdatastr.encode() this assumes "utf-8" and mangles the data == BAD

data = rawdatastr.encode('ascii','ignore') strips chars over 127 == BAD

data = rawdatastr.encode('latin1') not sure -- this is the closest so far but I have no proof that it is working for all bytes.

data = array.array('B', [x for x in map(ord,data)]).tobytes() This works but seems like a lot of work to do something simple. Is there something simpler?

I am thinking I need to write my own identity encoding that just passes the bytes along (I think latin1 does this based upon some reading but no proof thus far).

472

asked Mar 14 '17 19:03

nickdmax

2 Answers

Though I suspect something else is decoding your data for you (a char* in C is usually best represented as bytes, especially if it is binary data):

The latin1 codec can round trip every byte. You can verify this with the following short program:

>>> s = ''.join(chr(i) for i in range(0x100))
>>> s
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
>>> s2 = s.encode('latin1').decode('latin1')
>>> s2 == s
True
>>> sb = bytes(range(0x100))
>>> sb
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> sb == s.encode('latin1')
True

183

answered Sep 21 '22 23:09

Anthony Sottile

Just now I ran into the same problem. This is what I came up with:

import struct

def rawbytes(s):
    """Convert a string to raw bytes without encoding"""
    outlist = []
    for cp in s:
        num = ord(cp)
        if num < 255:
            outlist.append(struct.pack('B', num))
        elif num < 65535:
            outlist.append(struct.pack('>H', num))
        else:
            b = (num & 0xFF0000) >> 16
            H = num & 0xFFFF
            outlist.append(struct.pack('>bH', b, H))
    return b''.join(outlist)

Some examples:

In [34]: rawbytes('this is a test')
Out[34]: b'this is a test'

In [35]: rawbytes('\udc80\udcdf\udcff\udcff\udcff\x7f')
Out[35]: b'\xdc\x80\xdc\xdf\xdc\xff\xdc\xff\xdc\xff\x7f'

answered Sep 20 '22 23:09

Roland Smith

Related questions
                            
                                Test if dictionary key exists, is not None and isn't blank
                            
                                How to omit (remove) virtual environment (venv) from python coverage unit testing?
                            
                                How to initialize a dict from a SimpleNamespace?
                            
                                Configured debug type "python" is not supported for VS Code
                            
                                Unpack python tuple with [ ]'s [duplicate]
                            
                                Pipenv install fails on cryptography package: "Disabling PEP 517 processing is invalid" error
                            
                                dyld: Library not loaded: /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
                            
                                How to run application with parameters in Python?
                            
                                tk messagebox import confusion
                            
                                IntelliJ IDEA 12 Python Package Manager?
                            
                                Python: list.sort() query when list contains different element types
                            
                                How to install Flask on Python3 using pip?
                            
                                Itertools product without repeating duplicates
                            
                                regex.sub() gives different results to re.sub()
                            
                                How can I hint that a type is comparable with typing
                            
                                CSV file upload from buffer to S3
                            
                                How to print a numpy.array in one line?
                            
                                AttributeError: module 'librosa' has no attribute 'output'
                            
                                Replacement for getstatusoutput in Python 3
                            
                                How can I change a specific row label in a Pandas dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With