Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean way to read a null-terminated (C-style) string from a file?

I'm looking for a clean and simple way to read a null-terminated C string from a file or file-like object in Python. In a way that doesn't consume more input from the file than it needs, or pushes it back onto whatever file/buffer it works with such that other code can read the data immediately after a null-terminated string.

I've seen a bit of rather ugly code to do it, but not much that I'd like to use.

universal newlines support only works for open()ed files, not StringIO objects etc, and doesn't look like it handles unconventional newlines. Also, if it did work it'd result in strings with \n appended, which is undesirable.

struct doesn't look like it supports reading arbitrary-length C strings at all, requiring a length as part of the format.

ctypes has c_buffer, which can be constructed from a byte string and will return the first null terminated string as its value. Again, this requires determining how much must be read in advance, and it doesn't differentiate between null-terminated and unterminated strings. The same is true of c_char_p. So it doesn't seem to help much, since you already have to know you've read enough of the string and have to handle buffer splits.

The usual way to do this in C is read chunks into a buffer, copying and resizing the buffer if needed, then check if the newest chunk read contains a null byte. If it does, return everything up to the null byte and either realign the buffer or if you're being fancy, keep on reading and use it as a ring buffer. (This only works if you can hand the excess data read back to the caller, or if your platform's ungetc lets to push a lot back onto the file, of course.)

Is it necessary to spell out similar code in Python? I was surprised not to find anything canned in io, ctypes or struct.

file objects don't seem to have a way to push back onto their buffer, like ungetc, and neither do buffered I/O streams in the io module.

I feel like I must be missing the obvious here. I'd really rather avoid byte-by-byte reading:

def readcstr(f):
    buf = bytearray()
    while True:
        b = f.read(1)
        if b is None or b == '\0':
            return str(buf)
        else:
            buf.append(b)

but right now that's what I'm doing.

like image 1000
Craig Ringer Avatar asked Sep 25 '15 04:09

Craig Ringer


1 Answers

Incredibly mild improvement on what you have (mostly in that it uses more built-ins that, in CPython, are implemented in C, which usually runs faster):

import functools
import itertools

def readcstr(f):
    toeof = iter(functools.partial(f.read, 1), '')
    return ''.join(itertools.takewhile('\0'.__ne__, toeof))

This is relatively ugly (and sensitive to the type of the file object; it won't work with file objects that return unicode), but pushes all the work to the C layer. The two arg iter ensures you stop if the file is exhausted, while itertools.takewhile looks for (and consumes) the NUL terminator but no more; ''.join then combines the bytes read into a single return value.

like image 70
ShadowRanger Avatar answered Sep 28 '22 10:09

ShadowRanger