Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python-boilerpipe hangs with multiprocessing

I am trying to run boilerpipe with Python multiprocessing. Doing this to parse RSS feeds from multiple sources. The problem is it hangs in one of the threads after processing some links. The whole flow works if I remove the pool and run it in a loop.

Here is my multiprocessing code:

proc_pool = Pool(processes=4)
for each_link in data:
    proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()

This is my boilerpipe code which is being called inside process_link_for_feeds():

def parse_using_bp(in_url):
    extracted_html = ""
    if ContentParser.url_skip_p.match(in_url):
        return extracted_html
    try:
        extractor = Extractor(extractor='ArticleExtractor', url=in_url)
        extracted_html = extractor.getHTML()
        del extractor
    except BaseException as e:
        print "Something's wrong at Boilerpipe -->", in_url, "-->", e
        extracted_html = ""
    finally:
        return extracted_html

I am clueless on why it is hanging. Is there something wrong in the proc_pool code?

like image 652
dpatro Avatar asked Dec 06 '13 09:12

dpatro


1 Answers

Can you try threading instead? Multiprocessing is basically for when you are CPU bound. Also, boilerpipe already includes protection when using threading which suggests that it may need protection in multiprocessing also.

If you really need mp, I will try to figure out how to patch boilerpipe.

Here is what I guess will be a drop-in replacement using threading. It uses multiprocessing.pool.ThreadPool (which is a "fake" multiprocessing pool). The only change is from Pool(..) to multiprocessing.pool.ThreadPool(...) The problem is that I'm not sure the boilerpipe multithreading test will detect the thread pool () as having activeCount() > 1.

import multiprocessing
from multiprocessing.pool import ThreadPool  # hidden ThreadPool class

# ...
proc_pool = ThreadPool(processes=4)  # this is the only difference
for each_link in data:
    proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()
like image 197
KobeJohn Avatar answered Sep 22 '22 05:09

KobeJohn