I need to use a with open statement for opening the files, because I need to open a few hundred files together and merge them using K-way merge.
So at this point, I need very fast I/O that does not store the whole/huge portion of file in memory (because there are hundreds of files, each of ~10MB). I just need to read one line at a time for K-way merge. Reducing memory usage is my primary focus right now.
I learned that with open is the most efficient technique, but I cannot understand how to open all the files together in a single with open statement.
It's fairly easy to write your own context manager to handle this by using the built-in contextmanager function decorator to define "a factory function for with statement context managers" as the documentation puts it. For example:
from contextlib import contextmanager
@contextmanager
def multi_file_manager(files, mode='rt'):
""" Open multiple files and make sure they all get closed. """
files = [open(file, mode) for file in files]
yield files
for file in files:
file.close()
if __name__ == '__main__':
filenames = 'file1', 'file2', 'file3'
with multi_file_manager(filenames) as files:
a = files[0].readline()
b = files[2].readline()
...
If you don't know all the files ahead of time, it would be equally easy to create a context manager that supported adding them incrementally with the context. In the code below, a contextlib.ContextDecorator is used as the base class to simplify the implementation of a MultiFileManager class.
from contextlib import ContextDecorator
class MultiFileManager(ContextDecorator):
def __init__(self, files=None):
self.files = [] if files is None else files
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
for file in self.files:
file.close()
def __iadd__(self, other):
"""Add file to be closed when leaving context."""
self.files.append(other)
return self
if __name__ == '__main__':
filenames = 'mfm_file1.txt', 'mfm_file2.txt', 'mfm_file3.txt'
with MultiFileManager() as mfmgr:
for count, filename in enumerate(filenames, start=1):
file = open(filename, 'w')
mfmgr += file # Add file to be closed later.
file.write(f'this is file {count}\n')
While not a solution for 2.7, I should note there is one good, correct solution for 3.3+, contextlib.ExitStack, which can be used to do this correctly (surprisingly difficult to get right when you roll your own) and nicely:
from contextlib import ExitStack
with open('source_dataset.txt') as src_file, ExitStack() as stack:
files = [stack.enter_context(open(fname, 'w')) for fname in fname_list]
# do stuff with src_file and the values in files ...
# src_file and all elements in stack cleaned up on block exit
Importantly, if any of the opens fails, all of the opens that succeeded prior to that point will be cleaned up deterministically; most naive solutions end up failing to clean up in that case, relying on the garbage collector at best, and in cases like lock acquisition where there is no object to collect, failing to ever release the lock.
Posted here since this question was marked as the "original" for a duplicate that didn't specify Python version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With