Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lazyily filtering a file before reading

Tags:

python

file

Suppose I have a BIG file with some lines I wish to ignore, and a function (file_function) which takes a file object. Can I return a new file object whose lines satisfy some condition without reading the entire file first, this laziness is the important part.

Note: I could just save a temporary file with these lines ignored, but this is not ideal.

For example, suppose I had a csv file (with a bad line):

1,2
ooops
3,4

A first attempt was to create new file object (with same methods as file) and overwrite readline:

class FileWithoutCondition(file):
    def __init__(self, f, condition):
        self.f = f
        self.condition = condition
    def readline(self):
        while True:
            x = self.f.readline()
            if self.condition(x):
                return x

This works if file_name only uses readline... but not if it requires some other functionality.

with ('file_name', 'r') as f:
    f1 = FileWithoutOoops(f, lambda x: x != 'ooops\n')
    result = file_function(f1)

A solution using StringIO may work, but I can't seem to get it to.

Ideally we should assume that file_function is a blackbox function, specifically I can't just tweak it to accept a generator (but maybe I can tweak a generator to be file-like?).
Is there a standard way to do this kind of lazy (skim-)reading of a generic file?

Note: the motivating example to this question is this pandas question, where just having a readline is not enough to get pd.read_csv working...

like image 836
Andy Hayden Avatar asked Feb 26 '13 13:02

Andy Hayden


1 Answers

Use a map-reduce approach with existing Python facilities. In this example I'm using a regular expression for matching lines that start with the string GET /index, but you can use whatever condition fits your bill:

import re
from collections import defaultdict

pattern = re.compile(r'GET /index\(.*\).html')

# define FILE appropriately.
# map
# the condition here serves to filter lines that can not match.
matches = (pattern.search(line) for line in file(FILE, "rb") if 'GET' in line)
mapp    = (match.group(1) for match in matches if match)

# now reduce, lazy:
count = defaultdict(int)
for request in mapp:
    count[request] += 1

This scans a >6GB file in a few seconds on my laptop. You can further split a file in chunks and feed them to threads or processes. Use of mmap I do not recommend unless you have the memory to map the entire file (it doesn't support windowing).

like image 190
Michael Foukarakis Avatar answered Nov 03 '22 08:11

Michael Foukarakis