Stream multiple files into a readable object in Python

Tags:

I have a function which processes binary data from a file using file.read(len) method. However, my file is huge and is cut into many smaller files 50 MBytes each. Is there some wrapper class that feeds many files into a buffered stream, and provides a read() method?

Class fileinput.FileInput can do such a thing, but it supports only line-by-line reading (method readline() with no arguments) and does not have read(len) with specifying number of bytes to read.

351

asked Jul 02 '14 10:07

xivaxy

2 Answers

It's quite easy to concatenate iterables with itertools.chain:

from itertools import chain

def read_by_chunks(file_objects, block_size=1024):
    readers = (iter(lambda f=f: f.read(block_size), '') for f in file_objects)
    return chain.from_iterable(readers)

You can then do:

for chunk in read_by_chunks([f1, f2, f3, f4], 4096):
    handle(chunk)

To process the files in sequence while reading it by chunks of 4096 bytes.

If you need to provide an object with a read method because some other function expects that you can write a very simple wrapper:

class ConcatFiles(object):
    def __init__(self, files, block_size):
        self._reader = read_by_chunks(files, block_size)

    def __iter__(self):
        return self._reader

    def read(self):
        return next(self._reader, '')

This however only uses a fixed block size. It's possible to support the block_size parameter for the read by doing something like:

def read(self, block_size=None):
    block_size = block_size or self._block_size
    total_read = 0
    chunks = []

    for chunk in self._reader:
        chunks.append(chunk)
        total_read += len(chunk)
        if total_read > block_size:
            contents = ''.join(chunks)
            self._reader = chain([contents[block_size:]], self._reader)
            return contents[:block_size]
    return ''.join(chunks)

Note: if you are reading in binary mode you should replace the empty strings '' in the code with empty bytes b''.

answered Nov 03 '22 00:11

Bakuriu

Instead of converting the list of streams into a generator - as some of the other answers do - you can chain the streams together and then use the file interface:

def chain_streams(streams, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    Chain an iterable of streams together into a single buffered stream.
    Usage:
        def generate_open_file_streams():
            for file in filenames:
                yield open(file, 'rb')
        f = chain_streams(generate_open_file_streams())
        f.read()
    """

    class ChainStream(io.RawIOBase):
        def __init__(self):
            self.leftover = b''
            self.stream_iter = iter(streams)
            try:
                self.stream = next(self.stream_iter)
            except StopIteration:
                self.stream = None

        def readable(self):
            return True

        def _read_next_chunk(self, max_length):
            # Return 0 or more bytes from the current stream, first returning all
            # leftover bytes. If the stream is closed returns b''
            if self.leftover:
                return self.leftover
            elif self.stream is not None:
                return self.stream.read(max_length)
            else:
                return b''

        def readinto(self, b):
            buffer_length = len(b)
            chunk = self._read_next_chunk(buffer_length)
            while len(chunk) == 0:
                # move to next stream
                if self.stream is not None:
                    self.stream.close()
                try:
                    self.stream = next(self.stream_iter)
                    chunk = self._read_next_chunk(buffer_length)
                except StopIteration:
                    # No more streams to chain together
                    self.stream = None
                    return 0  # indicate EOF
            output, self.leftover = chunk[:buffer_length], chunk[buffer_length:]
            b[:len(output)] = output
            return len(output)

    return io.BufferedReader(ChainStream(), buffer_size=buffer_size)

Then use it as any other file/stream:

f = chain_streams(open_files_or_chunks)
f.read(len)

answered Nov 02 '22 23:11

Hardbyte

Related questions
                            
                                An efficient way to calculate the mean of each column or row of non-zero elements
                            
                                Add date tickers to a matplotlib/python chart
                            
                                How to unpack optional items from a tuple? [duplicate]
                            
                                Python - convert set-cookies response to dict of cookies
                            
                                Is there an easy way to unpack a tuple while using enumerate in loop?
                            
                                Kivy. Text provider error
                            
                                Python convex hull with scipy.spatial.Delaunay, how to eleminate points inside the hull?
                            
                                Why do I get E127 from this vimscript?
                            
                                Fourier smoothing of data set
                            
                                Python: case where x==y and x.__eq__y() return different things. Why?
                            
                                "python manage.py syncdb" not creating tables
                            
                                excluding url pattern from django app..is it possible?
                            
                                List of arguments with argparse
                            
                                How to detect if all the rows of a non-square matrix are orthogonal in python
                            
                                Django 'DateField' object has no attribute 'is_hidden'
                            
                                Pass extra values along with urls to scrapy spider
                            
                                add label to subplot in matplotlib
                            
                                pycrypto installation: configure error: cannot run C compiled programs
                            
                                py.test: hide stacktrace lines from unittest module
                            
                                HMAC signing requests in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stream multiple files into a readable object in Python

Tags:

python

file-io

python-2.7

xivaxy

People also ask

2 Answers

Bakuriu

Hardbyte

Recent Activity

Donate For Us