I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read(self) method that read self.file_name file.
Look at this very simple (but not real) example, I'm not using regex, but I'm splitting lines:
import multiprocessing
from collections import defaultdict
def SingleContainer():
return list()
class Container(defaultdict):
"""
this class store odd line in self["odd"] and even line in self["even"].
It is stupid, but it's only an example. In the real case the class
has additional methods that do computation on readen data.
"""
def __init__(self,file_name):
if type(file_name) != str:
raise AttributeError, "%s is not a string" % file_name
defaultdict.__init__(self,SingleContainer)
self.file_name = file_name
self.readen_lines = 0
def read(self):
f = open(self.file_name)
print "start reading file %s" % self.file_name
for line in f:
self.readen_lines += 1
values = line.split()
key = {0: "even", 1: "odd"}[self.readen_lines %2]
self[key].append(values)
print "readen %d lines from file %s" % (self.readen_lines, self.file_name)
def do(file_name):
container = Container(file_name)
container.read()
return container.items()
if __name__ == "__main__":
file_names = ["r1_200909.log", "r1_200910.log"]
pool = multiprocessing.Pool(len(file_names))
result = pool.map(do,file_names)
pool.close()
pool.join()
print "Finish"
At the end I need to join every results in a single Container. It is important that the order of the lines is preserved. My approach is too slow when returning values. Better solution? I'm using python 2.6 on Linux
Use the glob function in the python library glob to find all the files you want to analyze. You can have multiple for loops nested inside each other. Python can only print strings to files. Don't forget to close files so python will actually write them.
Multiprocessing in Python enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel.
You're probably hitting two problems.
One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.
Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.
If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:
Two threads: 0.36elapsed 168%CPU
One thread (replace pool.map with map): 0:00.52elapsed 98%CPU
Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With