Suppose I have a BIG file with some lines I wish to ignore, and a function (file_function
) which takes a file object. Can I return a new file object whose lines satisfy some condition without reading the entire file first, this laziness is the important part.
Note: I could just save a temporary file with these lines ignored, but this is not ideal.
For example, suppose I had a csv file (with a bad line):
1,2
ooops
3,4
A first attempt was to create new file object (with same methods as file) and overwrite readline
:
class FileWithoutCondition(file):
def __init__(self, f, condition):
self.f = f
self.condition = condition
def readline(self):
while True:
x = self.f.readline()
if self.condition(x):
return x
This works if file_name
only uses readline
... but not if it requires some other functionality.
with ('file_name', 'r') as f:
f1 = FileWithoutOoops(f, lambda x: x != 'ooops\n')
result = file_function(f1)
A solution using StringIO may work, but I can't seem to get it to.
Ideally we should assume that file_function
is a blackbox function, specifically I can't just tweak it to accept a generator (but maybe I can tweak a generator to be file-like?).
Is there a standard way to do this kind of lazy (skim-)reading of a generic file?
Note: the motivating example to this question is this pandas question, where just having a readline
is not enough to get pd.read_csv
working...
Use a map-reduce approach with existing Python facilities. In this example I'm using a regular expression for matching lines that start with the string GET /index
, but you can use whatever condition fits your bill:
import re
from collections import defaultdict
pattern = re.compile(r'GET /index\(.*\).html')
# define FILE appropriately.
# map
# the condition here serves to filter lines that can not match.
matches = (pattern.search(line) for line in file(FILE, "rb") if 'GET' in line)
mapp = (match.group(1) for match in matches if match)
# now reduce, lazy:
count = defaultdict(int)
for request in mapp:
count[request] += 1
This scans a >6GB file in a few seconds on my laptop. You can further split a file in chunks and feed them to threads or processes. Use of mmap
I do not recommend unless you have the memory to map the entire file (it doesn't support windowing).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With