I am trying to run boilerpipe
with Python multiprocessing
. Doing this to parse RSS feeds from multiple sources. The problem is it hangs in one of the threads after processing some links. The whole flow works if I remove the pool and run it in a loop.
Here is my multiprocessing code:
proc_pool = Pool(processes=4)
for each_link in data:
proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()
This is my boilerpipe
code which is being called inside process_link_for_feeds()
:
def parse_using_bp(in_url):
extracted_html = ""
if ContentParser.url_skip_p.match(in_url):
return extracted_html
try:
extractor = Extractor(extractor='ArticleExtractor', url=in_url)
extracted_html = extractor.getHTML()
del extractor
except BaseException as e:
print "Something's wrong at Boilerpipe -->", in_url, "-->", e
extracted_html = ""
finally:
return extracted_html
I am clueless on why it is hanging. Is there something wrong in the proc_pool
code?
Can you try threading instead? Multiprocessing is basically for when you are CPU bound. Also, boilerpipe already includes protection when using threading which suggests that it may need protection in multiprocessing also.
If you really need mp, I will try to figure out how to patch boilerpipe.
Here is what I guess will be a drop-in replacement using threading. It uses multiprocessing.pool.ThreadPool (which is a "fake" multiprocessing pool). The only change is from Pool(..)
to multiprocessing.pool.ThreadPool(...)
The problem is that I'm not sure the boilerpipe multithreading test will detect the thread pool () as having activeCount() > 1
.
import multiprocessing
from multiprocessing.pool import ThreadPool # hidden ThreadPool class
# ...
proc_pool = ThreadPool(processes=4) # this is the only difference
for each_link in data:
proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With