Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Writing to a single file with queue while using multiprocessing Pool

Tags:

I have hundreds of thousands of text files that I want to parse in various ways. I want to save the output to a single file without synchronization problems. I have been using multiprocessing pool to do this to save time, but I can't figure out how to combine Pool and Queue.

The following code will save the infile name as well as the maximum number of consecutive "x"s in the file. However, I want all processes to save results to the same file, and not to different files as in my example. Any help on this would be greatly appreciated.

import multiprocessing  with open('infilenamess.txt') as f:     filenames = f.read().splitlines()  def mp_worker(filename):  with open(filename, 'r') as f:       text=f.read()       m=re.findall("x+", text)       count=len(max(m, key=len))       outfile=open(filename+'_results.txt', 'a')       outfile.write(str(filename)+'|'+str(count)+'\n')       outfile.close()  def mp_handler():     p = multiprocessing.Pool(32)     p.map(mp_worker, filenames)  if __name__ == '__main__':     mp_handler() 
like image 762
risraelsen Avatar asked Oct 27 '14 20:10

risraelsen


People also ask

How does multiprocessing queue work in Python?

A queue is a data structure on which items can be added by a call to put() and from which items can be retrieved by a call to get(). The multiprocessing. Queue provides a first-in, first-out FIFO queue, which means that the items are retrieved from the queue in the order they were added.

Is queue process safe Python?

Queues are thread and process safe.

Does Python multiprocessing use multiple cores?

Key Takeaways. Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.

What is a Daemonic process Python?

Daemon processes in Python Python multiprocessing module allows us to have daemon processes through its daemonic option. Daemon processes or the processes that are running in the background follow similar concept as the daemon threads. To execute the process in the background, we need to set the daemonic flag to true.


1 Answers

Multiprocessing pools implement a queue for you. Just use a pool method that returns the worker return value to the caller. imap works well:

import multiprocessing  import re  def mp_worker(filename):     with open(filename) as f:         text = f.read()     m = re.findall("x+", text)     count = len(max(m, key=len))     return filename, count  def mp_handler():     p = multiprocessing.Pool(32)     with open('infilenamess.txt') as f:         filenames = [line for line in (l.strip() for l in f) if line]     with open('results.txt', 'w') as f:         for result in p.imap(mp_worker, filenames):             # (filename, count) tuples from worker             f.write('%s: %d\n' % result)  if __name__=='__main__':     mp_handler() 
like image 101
tdelaney Avatar answered Sep 21 '22 18:09

tdelaney