ThreadPoolExecutor vs threading.Thread

Tags:

I've got a question regarding performance of ThreadPoolExecutor vs Thread class on its own which seems to me that I lack some fundamental understanding.

I've a web scraper in two functions. First to parse the links for each image of a website homepage and the second to load an image off the link parsed:

import threading
import urllib.request
from bs4 import BeautifulSoup as bs
import os
from concurrent.futures import ThreadPoolExecutor

path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test'
url = 'https://www.bbc.co.uk'

# Function to parse link anchors for images
def img_links_parser(url, links_list):
    res = urllib.request.urlopen(url)
    soup = bs(res,'lxml')
    content = soup.findAll('div',{'class':'top-story__image'})

    for i in content:
        try:
            link = i.attrs['style']
            # Pulling the anchor from parentheses
            link = link[link.find('(')+1 : link.find(')')]
            # Putting the anchor in the list of links
            links_list.append(link)
        except:
            # links might be under 'data-lazy' attribute w/o paranthesis
            links_list.append(i.attrs['data-lazy'])

# Function to load images from links
def img_loader(base_url, links_list, path_location):
    for link in links_list:
        try:
            # Pulling last element off the link which is name.jpg
            file_name = link.split('/')[-1]
            # Following the link and saving content in a given direcotory
            urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link), 
            os.path.join(path_location, file_name))
        except:
            print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))

The following code is split up in to two cases:

Case 1: I'm using multiple threads:

threads = []
t1 = threading.Thread(target = img_loader, args = (url, links[:10], path))
t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path))
t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path))
t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path))
t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path))
t6 = threading.Thread(target = img_loader, args = (url, links[50:], path))

threads.extend([t1,t2,t3,t4,t5,t6])
for t in threads:
    t.start()
for t in threads:
    t.join()

The above code does its job on my machine for 10 seconds.

Case 2: I'm using ThreadPoolExecutor

with ThreadPoolExecutor(50) as exec:
    results = exec.submit(img_loader, url, links, path)

The above code results to 18 seconds.

My understanding was that ThreadPoolExecutor creates a thread for each worker. So, given I set max_workers to 50 would result to 50 threads and therefore should have completed the job faster.

Can someone please explain what am I missing here? I admit that I'm making a silly mistake here but I just don't get it.

Many thanks!

658

asked Dec 27 '17 16:12

Vlad

1 Answers

In Case 2 you're sending all the links to one worker. Instead of

exec.submit(img_loader, url, links, path)

you'd need to:

for link in links:
    exec.submit(img_loader, url, [link], path)

I didn't try it out myself, that's just from reading the documentation of ThreadPoolExecutor

152

answered Sep 18 '22 03:09

hansaplast

Related questions
                            
                                omp_get_num_threads() and omp_get_thread_num() returning nonsense
                            
                                OpenMP + linux - GOMP_4.0 not found
                            
                                Thread contention on java.io.PrintStream
                            
                                Cleaning up threads in a DLL: _endthreadex() vs TerminateThread()
                            
                                Passing class's member function to std::thread [duplicate]
                            
                                The difference btween std::atomic and std::mutex
                            
                                How to stop a running Thread in Java
                            
                                volatile vs threadLocal in java
                            
                                .NET: why store Sync Block in every object?
                            
                                May the removal of an unused field cause a garbage collection?
                            
                                BeginInvoke with/without using MethodInvoker—does it make any difference?
                            
                                What is the purpose of await() in CountDownLatch?
                            
                                What's the recommended corePoolSize passed to ThreadPoolExecutor/ScheduledThreadPoolExecutor?
                            
                                Making static method Synchronized or Not
                            
                                Download files by chunks in multiple threads in Go
                            
                                Real-world example where std::atomic::compare_exchange used with two memory_order parameters
                            
                                Redisson client - thread safe
                            
                                How std::packaged_task works
                            
                                How to use pybind11 in multithreaded application
                            
                                Which thread does Runnable run on?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

ThreadPoolExecutor vs threading.Thread

Tags:

python-3.x

multithreading

threadpoolexecutor

Vlad

People also ask

1 Answers

hansaplast

Recent Activity

Donate For Us