Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to restore corrupted “interned” bytes-objects

It is well known, that small bytes-objects are automatically "interned" by CPython (similar to the intern-function for strings). Correction: As explained by @abarnert it is more like the integer-pool than the interned strings.

Is it possible to restore the interned bytes-objects after they have been corrupted by let's say an "experimental" third party library or is the only way to restart the kernel?

The proof of concept can be done with Cython-functionality (Cython>=0.28):

%%cython
def do_bad_things():
   cdef bytes b=b'a'
   cdef const unsigned char[:] safe=b  
   cdef char *unsafe=<char *> &safe[0]   #who needs const and type-safety anyway?
   unsafe[0]=98                          #replace through `b`

or as suggested by @jfs through ctypes:

import ctypes
import sys
def do_bad_things():
    b = b'a'; 
    (ctypes.c_ubyte * sys.getsizeof(b)).from_address(id(b))[-2] = 98

Obviously, by misusing C-functionality, do_bad_things changes immutable (or so the CPython thinks) object b'a' to b'b' and because this bytes-object is interned, we can see bad things happen afterwards:

>>> do_bad_things() #b'a' means now b'b'
>>> b'a'==b'b'  #wait for a surprise  
True
>>> print(b'a') #another one
b'b'

It is possible to restore/clear the byte-object-pool, so that b'a' means b'a' once again?


A little side note: It seems as if not every bytes-creation process is using this pool. For example:

>>> do_bad_things()
>>> print(b'a')
b'b'
>>> print((97).to_bytes(1, byteorder='little')) #ord('a')=97
b'a'
like image 560
ead Avatar asked Jun 05 '18 20:06

ead


2 Answers

Python 3 doesn't intern bytes objects the way it does str. Instead, it keeps a static array of them the way it does with int.

This is very different under the covers. On the down side, it means there's no table (with an API) to be manipulated. On the up side, it means that if you can find the static array, you can fix it, the same way you would for ints, because the array index and the character value of the string are supposed to be identical.

If you look in bytesobject.c, the array is declared at the top:

static PyBytesObject *characters[UCHAR_MAX + 1];

… and then, for example, within PyBytes_FromStringAndSize:

if (size == 1 && str != NULL &&
    (op = characters[*str & UCHAR_MAX]) != NULL)
{
#ifdef COUNT_ALLOCS
    one_strings++;
#endif
    Py_INCREF(op);
    return (PyObject *)op;
}

Notice that the array is static, so it's not accessible from outside this file, and that it's still refcounting the objects, so callers (even internal stuff in the interpreter, much less your C API extension) can't tell that there's anything special going on.

So, there's no "correct" way to clean this up.

But if you want to get hacky…

If you have a reference to any of the single-char bytes, and you know which character it was supposed to be, you can get to the start of the array and then clean up the whole thing.

Unless you've screwed up even more than you think, you can just construct a one-char bytes and subtract the character it was supposed to be. PyBytes_FromStringAndSize("a", 1) is going to return the object that's supposed to be 'a', even if it happens to actually hold 'b'. How do we know that? Because that's exactly the problem that you're trying to fix.

Actually, there are probably ways you could break things even worse… which all seem very unlikely, but to be safe, let's use a character you're less likely to have broken than a, like \x80:

PyBytesObject *byte80 = (PyBytesObject *)PyBytes_FromStringAndSize("\x80", 1);
PyBytesObject *characters = byte80 - 0x80;

The only other caveat is that if you try to do this from Python with ctypes instead of from C code, it would require some extra care,1 but since you're not using ctypes, let's not worry about that.

So, now we have a pointer to characters, we can walk it. We can't just delete the objects to "unintern" them, because that will hose anyone who has a reference to any of them, and probably lead to a segfault. But we don't have to. Any object that's in the table, we know what it's supposed to be—characters[i] is supposed to be a one-char bytes whose one character is i. So just set it back to that, with a loop something like this:

for (size_t char i=0; i!=UCHAR_MAX; i++) {
    if (characters[i]) {
        // do the same hacky stuff you did to break the string in the first place
    }
}

That's all there is to it.


Well, except for compilation.2

Fortunately, at the interactive interpreter, each complete top-level statement is its own compilation unit, so… you should be OK with any new line you type after running the fix.

But a module you've imported, that had to be compiled, while you had the broken strings? You've probably screwed up its constants. And I can't think of a good way to clean this up except to forcibly recompile and reimport every module.


1. The compiler might turn your b'\x80' argument into the wrong thing before it even gets to the C call. And you'd be surprised at all the places you think you're passing around a c_char_p and it's actually getting magically converted to and from bytes. Probably better to use a POINTER(c_uint8).

2. If you compiled some code with b'a' in it, the consts array should have a reference to b'a', which will get fixed. But, since bytes are known immutable to the compiler, if it knows that b'a' == b'b', it may actually store the pointer to the b'b' singleton instead, for the same reason that 123456 is 123456 is true, in which case fixing b'a' may not actually solve the problem.

like image 168
abarnert Avatar answered Nov 13 '22 17:11

abarnert


I followed the great explanation of @abarnert and here is my implementation of his idea in Cython.

Things to consider:

  1. There is a bytes-pool (like it is the case for integers) and not a dynamically structure (like it is the case for string-interning). So we can just brute-force all bytes-objects in this pool and ensure that they have the right value.
  2. Only bytes-objects constructed via PyBytes_FromStringAndSize and PyBytes_FromString are using the internal pool, so make sure to use them.

This leads to the following implementation:

%%cython
from libc.limits cimport UCHAR_MAX
from cpython.bytes cimport PyBytes_FromStringAndSize

cdef replace_first_byte(bytes obj, unsigned char new_value):
   cdef const unsigned char[:] safe=obj  
   cdef unsigned char *unsafe=<unsigned char *> &safe[0]   
   unsafe[0]=new_value


def restore_bytes_pool():
    cdef char[1] ch
    #create all possible bytes-objects b`\x00` to b`x255`:
    for i in range(UCHAR_MAX+1):               
        ch[0]=<unsigned char>(i)
        obj=PyBytes_FromStringAndSize(ch, 1) #use it so the pool is used
        replace_first_byte(obj,i)

Slightly differences (and in my opinion advantages to the original proposal):

  1. this version doesn't need the knowledge, how the byte-object-pool is built and that it is a continuous array.
  2. no potentially corrupted bytes-objects are used.

And now:

>>> do_bad_things()
>>> print(b'a')
b'b'

>>> restore_bytes_pool()
>>> print(b'a')
b'a'

For testing purposes, there is function corrupting (almost) all objects in the pool:

def corrupt_bytes_pool():
    cdef char[1] ch
    for i in range(UCHAR_MAX+1):
        ch[0]=<unsigned char>(i)
        obj=PyBytes_FromStringAndSize(ch, 1)
        replace_first_byte(obj,98)           #sets all to b'b'
like image 24
ead Avatar answered Nov 13 '22 17:11

ead