Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non blocking stream that supports seek from http response in Python

I have an http response from urllib

response = urllib2.urlopen('http://python.org/')

Eventually, I want to be able to seek() within the response (at least to the beginning). So I want to be able to have code like this:

print result.readline()
result.seek(0)
print result.readline()

The simplest solution to this problem is StringIO or io.BytesIO like this:

result = io.BytesIO(response.read())

However, the thing is that the resources I want to request tend to be very large and I want to start working with them (parse...) before the whole download is finished. response.read() is blocking. I'm looking for a non-blocking solution.

The ideal code would read(BUFFER_SIZE) from the resource and whenever more content is needed, just request more from the response. I'm basically looking for a wrapper class that can do that. Oh, and I need a file like object.

I thought, I could write something like:

base = io.BufferedIOBase(response)
result = io.BufferedReader(base)

However, it turns out that this does not work and I have tried different classes from the io module but couldn't get it working. I'm happy with any wrapper class that has the desired behaviour.

like image 804
dominik Avatar asked Jan 04 '13 13:01

dominik


1 Answers

I wrote my own wrapper class which preserves the first chunk of data. This way I can seek back to the beginning, analyze the encoding, file type and other things. This class solves the problem for me and should be simple enough to adapt to other use cases.

class BufferedFile(object):
    ''' A buffered file that preserves the beginning of a stream up to buffer_size
    '''
    def __init__(self, fp, buffer_size=1024):
        self.data = cStringIO.StringIO()
        self.fp = fp
        self.offset = 0
        self.len = 0
        self.fp_offset = 0
        self.buffer_size = buffer_size

    @property
    def _buffer_full(self):
        return self.len >= self.buffer_size

    def readline(self):
        if self.len < self.offset < self.fp_offset:
            raise BufferError('Line is not available anymore')
        if self.offset >= self.len:
            line = self.fp.readline()
            self.fp_offset += len(line)

            self.offset += len(line)

            if not self._buffer_full:
                self.data.write(line)
                self.len += len(line)
        else:
            line = self.data.readline()
            self.offset += len(line)
        return line

    def seek(self, offset):
        if self.len < offset < self.fp_offset:
            raise BufferError('Cannot seek because data is not buffered here')
        self.offset = offset
        if offset < self.len:
            self.data.seek(offset)
like image 117
dominik Avatar answered Oct 13 '22 18:10

dominik