Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to convert string to ctypes.c_ubyte array in Python

I have a string of 20 bytes, and I would like to convert it a ctypes.c_ubyte array for bit field manipulation purposes.

 import ctypes
 str_bytes = '01234567890123456789'
 byte_arr = bytearray(str_bytes)
 raw_bytes = (ctypes.c_ubyte*20)(*(byte_arr))

Is there a way to avoid a deep copy from str to bytearray for the sake of the cast?

Alternatively, is it possible to convert a string to a bytearray without a deep copy? (With techniques like memoryview?)

I am using Python 2.7.

Performance results:

Using eryksun and Brian Larsen's suggestion, here are the benchmarks under a vbox VM with Ubuntu 12.04 and Python 2.7.

  • method1 uses my original post
  • method2 uses ctype from_buffer_copy
  • method3 uses ctype cast/POINTER
  • method4 uses numpy

Results:

  • method1 takes 3.87sec
  • method2 takes 0.42sec
  • method3 takes 1.44sec
  • method4 takes 8.79sec

Code:

import ctypes
import time
import numpy

str_bytes = '01234567890123456789'

def method1():
    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        byte_arr = bytearray(str_bytes)
        result = (ctypes.c_ubyte*20)(*(byte_arr))

    t1 = time.clock()
    print(t1-t0)

    return result

def method2():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        result = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

    t1 = time.clock()
    print(t1-t0)

    return result

def method3():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        result = ctypes.cast(str_bytes, ctypes.POINTER(ctypes.c_ubyte * 20))[0]

    t1 = time.clock()
    print(t1-t0)

    return result

def method4():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        arr = numpy.asarray(str_bytes)
        result = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes)))

    t1 = time.clock()
    print(t1-t0)

    return result

print(method1())
print(method2())
print(method3())
print(method4())
like image 242
askldjd Avatar asked Jan 31 '14 15:01

askldjd


2 Answers

I don't that's working how you think. bytearray creates a copy of the string. Then the interpreter unpacks the bytearray sequence into a starargs tuple and merges this into another new tuple that has the other args (even though there are none in this case). Finally, the c_ubyte array initializer loops over the args tuple to set the elements of the c_ubyte array. That's a lot of work, and a lot of copying, to go through just to initialize the array.

Instead you can use the from_buffer_copy method, assuming the string is a bytestring with the buffer interface (not unicode):

import ctypes    
str_bytes = '01234567890123456789'
raw_bytes = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

That still has to copy the string, but it's only done once, and much more efficiently. As was stated in the comments, a Python string is immutable and could be interned or used as a dict key. Its immutability should be respected, even if ctypes lets you violate this in practice:

>>> from ctypes import *
>>> s = '01234567890123456789'
>>> b = cast(s, POINTER(c_ubyte * 20))[0]
>>> b[0] = 97
>>> s
'a1234567890123456789'

Edit

I need to emphasize that I am not recommending using ctypes to modify an immutable CPython string. If you have to, then at the very least check sys.getrefcount beforehand to ensure that the reference count is 2 or less (the call adds 1). Otherwise, you will eventually be surprised by string interning for names (e.g. "sys") and code object constants. Python is free to reuse immutable objects as it sees fit. If you step outside of the language to mutate an 'immutable' object, you've broken the contract.

For example, if you modify an already-hashed string, the cached hash is no longer correct for the contents. That breaks it for use as a dict key. Neither another string with the new contents nor one with the original contents will match the key in the dict. The former has a different hash, and the latter has a different value. Then the only way to get at the dict item is by using the mutated string that has the incorrect hash. Continuing from the previous example:

>>> s
'a1234567890123456789'
>>> d = {s: 1}
>>> d[s]
1

>>> d['a1234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'a1234567890123456789'

>>> d['01234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '01234567890123456789'

Now consider the mess if the key is an interned string that's reused in dozens of places.


For performance analysis it's typical to use the timeit module. Prior to 3.3, timeit.default_timer varies by platform. On POSIX systems it's time.time, and on Windows it's time.clock.

import timeit

setup = r'''
import ctypes, numpy
str_bytes = '01234567890123456789'
arr_t = ctypes.c_ubyte * 20
'''

methods = [
  'arr_t(*bytearray(str_bytes))',
  'arr_t.from_buffer_copy(str_bytes)',
  'ctypes.cast(str_bytes, ctypes.POINTER(arr_t))[0]',
  'numpy.asarray(str_bytes).ctypes.data_as('
      'ctypes.POINTER(arr_t))[0]',
]

test = lambda m: min(timeit.repeat(m, setup))

>>> tabs = [test(m) for m in methods]
>>> trel = [t / tabs[0] for t in tabs]
>>> trel
[1.0, 0.060573711879182784, 0.261847116395079, 1.5389279092185282]
like image 173
Eryk Sun Avatar answered Oct 05 '22 03:10

Eryk Sun


As another solution for you to benchmark (I would be very interested in the results).

Using numpy might add some simplicity depending on what the whole code looks like.

import numpy as np
import ctypes
str_bytes = '01234567890123456789'
arr = np.asarray(str_bytes)
aa = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes)))
for v in aa.contents: print v
48
49
50
51
52
53
54
55
56
57
48
49
50
51
52
53
54
55
56
57
like image 32
Brian Larsen Avatar answered Oct 05 '22 03:10

Brian Larsen