Processing single file from multiple processes

Tags:

I have a single big text file in which I want to process each line ( do some operations ) and store them in a database. Since a single simple program is taking too long, I want it to be done via multiple processes or threads. Each thread/process should read the DIFFERENT data(different lines) from that single file and do some operations on their piece of data(lines) and put them in the database so that in the end, I have whole of the data processed and my database is dumped with the data I need.

But I am not able to figure it out that how to approach this.

585

asked Jun 25 '12 19:06

pranavk

1 Answers

What you are looking for is a Producer/Consumer pattern

Basic threading example

Here is a basic example using the threading module (instead of multiprocessing)

import threading import Queue import sys  def do_work(in_queue, out_queue):     while True:         item = in_queue.get()         # process         result = item         out_queue.put(result)         in_queue.task_done()  if __name__ == "__main__":     work = Queue.Queue()     results = Queue.Queue()     total = 20      # start for workers     for i in xrange(4):         t = threading.Thread(target=do_work, args=(work, results))         t.daemon = True         t.start()      # produce data     for i in xrange(total):         work.put(i)      work.join()      # get the results     for i in xrange(total):         print results.get()      sys.exit()

You wouldn't share the file object with the threads. You would produce work for them by supplying the queue with lines of data. Then each thread would pick up a line, process it, and then return it in the queue.

There are some more advanced facilities built into the multiprocessing module to share data, like lists and special kind of Queue. There are trade-offs to using multiprocessing vs threads and it depends on whether your work is cpu bound or IO bound.

Basic multiprocessing.Pool example

Here is a really basic example of a multiprocessing Pool

from multiprocessing import Pool  def process_line(line):     return "FOO: %s" % line  if __name__ == "__main__":     pool = Pool(4)     with open('file.txt') as source_file:         # chunk the work into batches of 4 lines at a time         results = pool.map(process_line, source_file, 4)      print results

A Pool is a convenience object that manages its own processes. Since an open file can iterate over its lines, you can pass it to the pool.map(), which will loop over it and deliver lines to the worker function. Map blocks and returns the entire result when its done. Be aware that this is an overly simplified example, and that the pool.map() is going to read your entire file into memory all at once before dishing out work. If you expect to have large files, keep this in mind. There are more advanced ways to design a producer/consumer setup.

Manual "pool" with limit and line re-sorting

This is a manual example of the Pool.map, but instead of consuming an entire iterable in one go, you can set a queue size so that you are only feeding it piece by piece as fast as it can process. I also added the line numbers so that you can track them and refer to them if you want, later on.

from multiprocessing import Process, Manager import time import itertools   def do_work(in_queue, out_list):     while True:         item = in_queue.get()         line_no, line = item          # exit signal          if line == None:             return          # fake work         time.sleep(.5)         result = (line_no, line)          out_list.append(result)   if __name__ == "__main__":     num_workers = 4      manager = Manager()     results = manager.list()     work = manager.Queue(num_workers)      # start for workers         pool = []     for i in xrange(num_workers):         p = Process(target=do_work, args=(work, results))         p.start()         pool.append(p)      # produce data     with open("source.txt") as f:         iters = itertools.chain(f, (None,)*num_workers)         for num_and_line in enumerate(iters):             work.put(num_and_line)      for p in pool:         p.join()      # get the results     # example:  [(1, "foo"), (10, "bar"), (0, "start")]     print sorted(results)

175

answered Sep 22 '22 13:09

jdi

Related questions
                            
                                How do I get JSON data from RESTful service using Python?
                            
                                Python - why use "self" in a class?
                            
                                Regex to extract URLs from href attribute in HTML with Python [duplicate]
                            
                                Add Leading Zeros to Strings in Pandas Dataframe
                            
                                How to remove \n from a list element?
                            
                                TensorFlow, "'module' object has no attribute 'placeholder'"
                            
                                /usr/bin/python3: Error while finding spec for 'virtualenvwrapper.hook_loader' (<class 'ImportError'>: No module named 'virtualenvwrapper')
                            
                                Sorting a list of dot-separated numbers, like software versions
                            
                                How to read a CSV file from a URL with Python?
                            
                                Sort a list of Class Instances Python [duplicate]
                            
                                python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to <undefined>
                            
                                How can I get the IP address from NIC in Python?
                            
                                When will Jython support Python 3?
                            
                                How do Python properties work?
                            
                                How does ThreadPoolExecutor().map differ from ThreadPoolExecutor().submit?
                            
                                Python: how to edit an installed package?
                            
                                How to add/subtract time (hours, minutes, etc.) from a Pandas DataFrame.Index whos objects are of type datetime.time?
                            
                                Why don't I get any syntax errors when I execute my Python script with Perl?
                            
                                What does a leading `\x` mean in a Python string `\xaa`
                            
                                Does Python forbid two similarly looking Unicode identifiers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Processing single file from multiple processes

Tags:

python

multithreading

multiprocessing

pranavk

People also ask

1 Answers

jdi

Recent Activity

Donate For Us