Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read multiple files using multiprocessing

Tags:

I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read(self) method that read self.file_name file.

Look at this very simple (but not real) example, I'm not using regex, but I'm splitting lines:


import multiprocessing
from collections import defaultdict

def SingleContainer():
    return list()

class Container(defaultdict):
    """
    this class store odd line in self["odd"] and even line in self["even"].
    It is stupid, but it's only an example. In the real case the class
    has additional methods that do computation on readen data.
    """
    def __init__(self,file_name):
        if type(file_name) != str:
            raise AttributeError, "%s is not a string" % file_name
        defaultdict.__init__(self,SingleContainer)
        self.file_name = file_name
        self.readen_lines = 0
    def read(self):
        f = open(self.file_name)
        print "start reading file %s" % self.file_name
        for line in f:
            self.readen_lines += 1
            values = line.split()
            key = {0: "even", 1: "odd"}[self.readen_lines %2]
            self[key].append(values)
        print "readen %d lines from file %s" % (self.readen_lines, self.file_name)

def do(file_name):
    container = Container(file_name)
    container.read()
    return container.items()

if __name__ == "__main__":
    file_names = ["r1_200909.log", "r1_200910.log"]
    pool = multiprocessing.Pool(len(file_names))
    result = pool.map(do,file_names)
    pool.close()
    pool.join()
    print "Finish"      

At the end I need to join every results in a single Container. It is important that the order of the lines is preserved. My approach is too slow when returning values. Better solution? I'm using python 2.6 on Linux

like image 549
Ruggero Turra Avatar asked Jan 15 '10 00:01

Ruggero Turra


People also ask

How do you process multiple files in Python?

Use the glob function in the python library glob to find all the files you want to analyze. You can have multiple for loops nested inside each other. Python can only print strings to files. Don't forget to close files so python will actually write them.

Does multiprocessing in Python use multiple cores?

Multiprocessing in Python enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel.


1 Answers

You're probably hitting two problems.

One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.

Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.

If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:

Two threads: 0.36elapsed 168%CPU

One thread (replace pool.map with map): 0:00.52elapsed 98%CPU

Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.

like image 132
Glenn Maynard Avatar answered Nov 25 '22 05:11

Glenn Maynard