Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return list in the original order when when using concurrent futures

I'm using concurrent futures to speed up an IO bound process (retrieving the H1 heading from a list of urls found on the Wayback Machine. The code works, but it returns the list in an arbitrary order. I'm looking for a way to return the URLs in the same order as the original list.

archive_url_list = ['https://web.archive.org/web/20171220002410/http://www.manueldrivingschool.co.uk:80/areas-covered-for-driving-lessons', 'https://web.archive.org/web/20210301102140/https://www.manueldrivingschool.co.uk/contact.php', 'https://web.archive.org/web/20210301102140/https://www.manueldrivingschool.co.uk/contact.php', 'https://web.archive.org/web/20171220002415/http://www.manueldrivingschool.co.uk:80/contact', 'https://web.archive.org/web/20160520140505/http://www.manueldrivingschool.co.uk:80/about.php', 'https://web.archive.org/web/20180102123922/http://www.manueldrivingschool.co.uk:80/about']

import waybackpy
import concurrent.futures

archive_h1_list = []
def get_archive_h1(h1_url):
    html = urlopen(h1_url)
    bsh = BeautifulSoup(html.read(), 'lxml')
    return bsh.h1.text.strip()

def concurrent_calls():
    with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
        f1 = (executor.submit(get_archive_h1, h1_url) for h1_url in archive_url_list)
        for future in concurrent.futures.as_completed(f1):
            try:
                data = future.result()
                archive_h1_list.append(data)
            except Exception:
                archive_h1_list.append("No Data Received!")
                pass

if __name__ == '__main__':
    concurrent_calls()
    print(archive_h1_list)

I've tried creating a second list to append the original URL to as the code runs in the hope I can tie it back after the fact, but all I get is an empty list. New to concurrent futures, hoping there's a standard way.

like image 333
Lee Roy Avatar asked Nov 17 '25 05:11

Lee Roy


1 Answers

Instead of a generator with ThreadPoolExecutor.submit, use ThreadPoolExecutor.map for order:

def concurrent_calls():
    with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
        f1 = executor.map(get_archive_h1, archive_url_list)
        ...

This is much more efficient.

like image 197
U12-Forward Avatar answered Nov 19 '25 19:11

U12-Forward