I want to treat many files as if they were all one file. What's the proper pythonic way to take [filenames] => [file objects] => [lines] with generators/not reading an entire file into memory?
We all know the proper way to open a file:
with open("auth.log", "rb") as f:
print sum(f.readlines())
And we know the correct way to link several iterators/generators into one long one:
>>> list(itertools.chain(range(3), range(3)))
[0, 1, 2, 0, 1, 2]
but how do I link multiple files together and preserve the context managers?
with open("auth.log", "rb") as f0:
with open("auth.log.1", "rb") as f1:
for line in itertools.chain(f0, f1):
do_stuff_with(line)
# f1 is now closed
# f0 is now closed
# gross
I could ignore the context managers and do something like this, but it doesn't feel right:
files = itertools.chain(*(open(f, "rb") for f in file_names))
for line in files:
do_stuff_with(line)
Or is this kind of what Async IO - PEP 3156 is for and I'll just have to wait for the elegant syntax later?
To read a file word by word in Python, you can loop over each line in a file and then get the words in each line by using the Python string split() function.
Just iterate over each line in the file. Python automatically checks for the End of file and closes the file for you (using the with syntax). Show activity on this post. This will work because the the readline() leaves a trailing newline character, where as EOF is just an empty string.
There's always fileinput
.
for line in fileinput.input(filenames):
...
Reading the source however, it appears that fileinput.FileInput
can't be used as a context manager1. To fix that, you could use contextlib.closing
since FileInput
instances have a sanely implemented close
method:
from contextlib import closing
with closing(fileinput.input(filenames)) as line_iter:
for line in line_iter:
...
An alternative with the context manager, is to write a simple function looping over the files and yielding lines as you go:
def fileinput(files):
for f in files:
with open(f,'r') as fin:
for line in fin:
yield line
No real need for itertools.chain
here IMHO ... The magic here is in the yield
statement which is used to transform an ordinary function into a fantastically lazy generator.
1As an aside, starting with python3.2, fileinput.FileInput
is implemented as a context manager which does exactly what we did before with contextlib
. Now our example becomes:
# Python 3.2+ version
with fileinput.input(filenames) as line_iter:
for line in line_iter:
...
although the other example will work on python3.2+ as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With