Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ThreadPoolExecutor vs threading.Thread

I've got a question regarding performance of ThreadPoolExecutor vs Thread class on its own which seems to me that I lack some fundamental understanding.

I've a web scraper in two functions. First to parse the links for each image of a website homepage and the second to load an image off the link parsed:

import threading
import urllib.request
from bs4 import BeautifulSoup as bs
import os
from concurrent.futures import ThreadPoolExecutor

path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test'
url = 'https://www.bbc.co.uk'

# Function to parse link anchors for images
def img_links_parser(url, links_list):
    res = urllib.request.urlopen(url)
    soup = bs(res,'lxml')
    content = soup.findAll('div',{'class':'top-story__image'})

    for i in content:
        try:
            link = i.attrs['style']
            # Pulling the anchor from parentheses
            link = link[link.find('(')+1 : link.find(')')]
            # Putting the anchor in the list of links
            links_list.append(link)
        except:
            # links might be under 'data-lazy' attribute w/o paranthesis
            links_list.append(i.attrs['data-lazy'])

# Function to load images from links
def img_loader(base_url, links_list, path_location):
    for link in links_list:
        try:
            # Pulling last element off the link which is name.jpg
            file_name = link.split('/')[-1]
            # Following the link and saving content in a given direcotory
            urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link), 
            os.path.join(path_location, file_name))
        except:
            print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))

The following code is split up in to two cases:

Case 1: I'm using multiple threads:

threads = []
t1 = threading.Thread(target = img_loader, args = (url, links[:10], path))
t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path))
t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path))
t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path))
t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path))
t6 = threading.Thread(target = img_loader, args = (url, links[50:], path))

threads.extend([t1,t2,t3,t4,t5,t6])
for t in threads:
    t.start()
for t in threads:
    t.join()

The above code does its job on my machine for 10 seconds.

Case 2: I'm using ThreadPoolExecutor

with ThreadPoolExecutor(50) as exec:
    results = exec.submit(img_loader, url, links, path)

The above code results to 18 seconds.

My understanding was that ThreadPoolExecutor creates a thread for each worker. So, given I set max_workers to 50 would result to 50 threads and therefore should have completed the job faster.

Can someone please explain what am I missing here? I admit that I'm making a silly mistake here but I just don't get it.

Many thanks!

like image 658
Vlad Avatar asked Dec 27 '17 16:12

Vlad


People also ask

Is Python ThreadPoolExecutor thread safe?

ThreadPoolExecutor Thread-Safety Although the ThreadPoolExecutor uses threads internally, you do not need to work with threads directly in order to execute tasks and get results. Nevertheless, when accessing resources or critical sections, thread-safety may be a concern.

Is ThreadPoolExecutor submit thread safe?

For ThreadPoolExecutor, it submit is thread safe. You can see the source code in jdk8. When adding a new task, it uses a mainLock to ensure the thread safe.

What is difference between ThreadPoolExecutor and ProcessPoolExecutor?

As their names suggest, the ThreadPoolExecutor uses threads internally, whereas the ProcessPoolExecutor uses processes. A process has a main thread and may have additional threads. A thread belongs to a process.

What is ThreadPoolExecutor?

ThreadPoolExecutor is an ExecutorService to execute each submitted task using one of possibly several pooled threads, normally configured using Executors factory methods. It also provides various utility methods to check current threads statistics and control them.


1 Answers

In Case 2 you're sending all the links to one worker. Instead of

exec.submit(img_loader, url, links, path)

you'd need to:

for link in links:
    exec.submit(img_loader, url, [link], path)

I didn't try it out myself, that's just from reading the documentation of ThreadPoolExecutor

like image 152
hansaplast Avatar answered Sep 18 '22 03:09

hansaplast