I'm trying to read a file in python (scan it lines and look for terms) and write the results- let say, counters for each term. I need to do that for a big amount of files (more than 3000). Is it possible to do that multi threaded? If yes, how? So, the scenario is like this: <ul> <li>Read each file and scan its lines</li> <li>Write counters to same output file for all the files I've read.</li> </ul> Second question is, does it improve the speed of read/write. Hope it is clear enough. Thanks, Ron.

Yes, it should be possible to do this in a parallel manner. However, in Python it's hard to achieve parallelism with multiple threads. For this reason <code>multiprocessing</code> is the better default choice for doing things in parallel. It is hard to say what kind of speedup you can expect to achieve. It depends on what fraction of the workload it will be possible to do in parallel (the more the better), and what fraction will have to be done serially (the less the better).

Read txt file with multi-threaded in python

Tags:

I'm trying to read a file in python (scan it lines and look for terms) and write the results- let say, counters for each term. I need to do that for a big amount of files (more than 3000). Is it possible to do that multi threaded? If yes, how?

So, the scenario is like this:

Read each file and scan its lines
Write counters to same output file for all the files I've read.

Second question is, does it improve the speed of read/write.

Hope it is clear enough. Thanks,

Ron.

982

asked Oct 15 '11 06:10

Ron D.

2 Answers

I agree with @aix, multiprocessing is definitely the way to go. Regardless you will be i/o bound -- you can only read so fast, no matter how many parallel processes you have running. But there can easily be some speedup.

Consider the following (input/ is a directory that contains several .txt files from Project Gutenberg).

import os.path from multiprocessing import Pool import sys import time  def process_file(name):     ''' Process one file: count number of lines and words '''     linecount=0     wordcount=0     with open(name, 'r') as inp:         for line in inp:             linecount+=1             wordcount+=len(line.split(' '))      return name, linecount, wordcount  def process_files_parallel(arg, dirname, names):     ''' Process each file in parallel via Poll.map() '''     pool=Pool()     results=pool.map(process_file, [os.path.join(dirname, name) for name in names])  def process_files(arg, dirname, names):     ''' Process each file in via map() '''     results=map(process_file, [os.path.join(dirname, name) for name in names])  if __name__ == '__main__':     start=time.time()     os.path.walk('input/', process_files, None)     print "process_files()", time.time()-start      start=time.time()     os.path.walk('input/', process_files_parallel, None)     print "process_files_parallel()", time.time()-start

When I run this on my dual core machine there is a noticeable (but not 2x) speedup:

$ python process_files.py process_files() 1.71218085289 process_files_parallel() 1.28905105591

If the files are small enough to fit in memory, and you have lots of processing to be done that isn't i/o bound, then you should see even better improvement.

158

answered Sep 25 '22 09:09

Austin Marshall

Yes, it should be possible to do this in a parallel manner.

However, in Python it's hard to achieve parallelism with multiple threads. For this reason multiprocessing is the better default choice for doing things in parallel.

It is hard to say what kind of speedup you can expect to achieve. It depends on what fraction of the workload it will be possible to do in parallel (the more the better), and what fraction will have to be done serially (the less the better).

answered Sep 24 '22 09:09

NPE

Related questions
                            
                                rtsp stream capturing
                            
                                Is my way of duplicating rows in data.table efficient?
                            
                                How does "scope" work with multiple script tags for Javascript in an HTML document?
                            
                                Play m3u8 video in android
                            
                                Writing big documents with Sweave. Is it possible to do as with LaTeX?
                            
                                Does reusing symbols improve SVG performance?
                            
                                Using CSS3 Animations in IE9+
                            
                                Change browser address bar URL with jQuery [duplicate]
                            
                                Can I reference a lambda from within itself using Ruby?
                            
                                IntelliJ IDEA + Maven what is the need for dependency entries in an iml file?
                            
                                Android - Understanding View.getLocalVisibleRect(Rect)
                            
                                Is it possible to adopt a process?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With