I noticed that passing Python objects to native code with ctypes
can break mutability expectations.
For example, if I have a C function like:
int print_and_mutate(char *str)
{
str[0] = 'X';
return printf("%s\n", str);
}
and I call it like this:
from ctypes import *
lib = cdll.LoadLibrary("foo.so")
s = b"asdf"
lib.print_and_mutate(s)
The value of s
changed, and is now b"Xsdf"
.
The Python docs say "You should be careful, however, not to pass them to functions expecting pointers to mutable memory.".
Is this only because it breaks expectations of which types are immutable, or can something else break as a result? In other words, if I go in with the clear understanding that my original bytes
object will change, even though normally bytes
are immutable, is that OK or will I get some kind of nasty surprise later if I don't use create_string_buffer
like I'm supposed to?
ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python.
ctypes is the de facto standard library for interfacing with C/C++ from CPython, and it provides not only full access to the native C interface of most major operating systems (e.g., kernel32 on Windows, or libc on *nix), but also provides support for loading and interfacing with dynamic libraries, such as DLLs or ...
Python makes assumptions about immutable objects, so mutating them will definitely break things. Here's a concrete example:
>>> import ctypes as c
>>> x = b'abc' # immutable string
>>> d = {x:123} # Used as key in dictionary (keys must be hashable/immutable)
>>> d
{b'abc': 123}
Now build a ctypes mutable buffer to the immutable object. id(x)
in CPython is the memory address of the Python object and sys.getsizeof()
returns the size of that object. PyBytes objects have some overhead, but the end of the object has the bytes of the string.
>>> sys.getsizeof(x)
36
>>> px=(c.c_char*36).from_address(id(x))
>>> px.raw
b'\x02\x00\x00\x00\x00\x00\x00\x000\x8fq\x0b\xfc\x7f\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\xf0\x06\xe61\xeb\x00\x1b\xa9abc\x00'
>>> px.raw[-4:] # last bytes of the object
b'abc\x00'
>>> px[-4]
b'a'
>>> px[-4] = b'y' # Mutate the ctypes buffer, mutating the "immutable" string
>>> x # Now it has a modified value.
b'ybc'
Now try to access the key in the dictionary. Keys are located in O(1) time using its hash, but the hash was on the original, "immutable" value so it is incorrect. The key can no longer be found by old or new value:
>>> d # Note that dictionary key changed, too.
{b'ybc': 123}
>>> d[b'ybc'] # Try to access the key
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: b'ybc'
>>> d[b'abc'] # Maybe original key will work? It hashes same as the original...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: b'abc'
Various objects are interned by CPython and reused. Examples are small integers (-5 to 127) but also short strings and some literals. This behaviour is entirely implementation defined and may freely change between releases. Changing such objects can trigger arbitrary behaviour, from nothing at all to entirely undefined behaviour.
That "original bytes object" is not yours, it is CPython's.
It sounds like the closest you can get to UB in CPython.
While it may not be happening at the moment, CPython could give you a pointer to read-only memory and the program will segfault.
Further, CPython could be sharing the string or subslices with other objects, and you would be modifying all of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With