Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cast a string to bytes without encoding

I have a bunch of binary data that comes to python via a char* from some C interface (not under my control) so I have a string of arbitrary binary data (what is normally a byte array). I would like to convert it to a byte array to simplify using it with other python functions but I can't seem to figure out how.

Examples that don't work:

data = rawdatastr.encode() this assumes "utf-8" and mangles the data == BAD

data = rawdatastr.encode('ascii','ignore') strips chars over 127 == BAD

data = rawdatastr.encode('latin1') not sure -- this is the closest so far but I have no proof that it is working for all bytes.

data = array.array('B', [x for x in map(ord,data)]).tobytes() This works but seems like a lot of work to do something simple. Is there something simpler?

I am thinking I need to write my own identity encoding that just passes the bytes along (I think latin1 does this based upon some reading but no proof thus far).

like image 472
nickdmax Avatar asked Mar 14 '17 19:03

nickdmax


People also ask

Is byte [] same as string?

Byte objects are sequence of Bytes, whereas Strings are sequence of characters. Byte objects are in machine readable form internally, Strings are only in human readable form. Since Byte objects are machine readable, they can be directly stored on the disk.

How do I convert string to bit in Python?

To convert a string to binary, we first append the string's individual ASCII values to a list ( l ) using the ord(_string) function. This function gives the ASCII value of the string (i.e., ord(H) = 72 , ord(e) = 101). Then, from the list of ASCII values we can convert them to binary using bin(_integer) .


2 Answers

Though I suspect something else is decoding your data for you (a char* in C is usually best represented as bytes, especially if it is binary data):

The latin1 codec can round trip every byte. You can verify this with the following short program:

>>> s = ''.join(chr(i) for i in range(0x100))
>>> s
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
>>> s2 = s.encode('latin1').decode('latin1')
>>> s2 == s
True
>>> sb = bytes(range(0x100))
>>> sb
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> sb == s.encode('latin1')
True
like image 183
Anthony Sottile Avatar answered Sep 21 '22 23:09

Anthony Sottile


Just now I ran into the same problem. This is what I came up with:

import struct

def rawbytes(s):
    """Convert a string to raw bytes without encoding"""
    outlist = []
    for cp in s:
        num = ord(cp)
        if num < 255:
            outlist.append(struct.pack('B', num))
        elif num < 65535:
            outlist.append(struct.pack('>H', num))
        else:
            b = (num & 0xFF0000) >> 16
            H = num & 0xFFFF
            outlist.append(struct.pack('>bH', b, H))
    return b''.join(outlist)

Some examples:

In [34]: rawbytes('this is a test')
Out[34]: b'this is a test'

In [35]: rawbytes('\udc80\udcdf\udcff\udcff\udcff\x7f')
Out[35]: b'\xdc\x80\xdc\xdf\xdc\xff\xdc\xff\xdc\xff\x7f'
like image 34
Roland Smith Avatar answered Sep 20 '22 23:09

Roland Smith