I've got a question regarding performance of ThreadPoolExecutor
vs Thread
class on its own which seems to me that I lack some fundamental understanding.
I've a web scraper in two functions. First to parse the links for each image of a website homepage and the second to load an image off the link parsed:
import threading
import urllib.request
from bs4 import BeautifulSoup as bs
import os
from concurrent.futures import ThreadPoolExecutor
path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test'
url = 'https://www.bbc.co.uk'
# Function to parse link anchors for images
def img_links_parser(url, links_list):
res = urllib.request.urlopen(url)
soup = bs(res,'lxml')
content = soup.findAll('div',{'class':'top-story__image'})
for i in content:
try:
link = i.attrs['style']
# Pulling the anchor from parentheses
link = link[link.find('(')+1 : link.find(')')]
# Putting the anchor in the list of links
links_list.append(link)
except:
# links might be under 'data-lazy' attribute w/o paranthesis
links_list.append(i.attrs['data-lazy'])
# Function to load images from links
def img_loader(base_url, links_list, path_location):
for link in links_list:
try:
# Pulling last element off the link which is name.jpg
file_name = link.split('/')[-1]
# Following the link and saving content in a given direcotory
urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link),
os.path.join(path_location, file_name))
except:
print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))
The following code is split up in to two cases:
Case 1: I'm using multiple threads:
threads = []
t1 = threading.Thread(target = img_loader, args = (url, links[:10], path))
t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path))
t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path))
t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path))
t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path))
t6 = threading.Thread(target = img_loader, args = (url, links[50:], path))
threads.extend([t1,t2,t3,t4,t5,t6])
for t in threads:
t.start()
for t in threads:
t.join()
The above code does its job on my machine for 10 seconds.
Case 2: I'm using ThreadPoolExecutor
with ThreadPoolExecutor(50) as exec:
results = exec.submit(img_loader, url, links, path)
The above code results to 18 seconds.
My understanding was that ThreadPoolExecutor
creates a thread for each worker. So, given I set max_workers
to 50 would result to 50 threads and therefore should have completed the job faster.
Can someone please explain what am I missing here? I admit that I'm making a silly mistake here but I just don't get it.
Many thanks!
ThreadPoolExecutor Thread-Safety Although the ThreadPoolExecutor uses threads internally, you do not need to work with threads directly in order to execute tasks and get results. Nevertheless, when accessing resources or critical sections, thread-safety may be a concern.
For ThreadPoolExecutor, it submit is thread safe. You can see the source code in jdk8. When adding a new task, it uses a mainLock to ensure the thread safe.
As their names suggest, the ThreadPoolExecutor uses threads internally, whereas the ProcessPoolExecutor uses processes. A process has a main thread and may have additional threads. A thread belongs to a process.
ThreadPoolExecutor is an ExecutorService to execute each submitted task using one of possibly several pooled threads, normally configured using Executors factory methods. It also provides various utility methods to check current threads statistics and control them.
In Case 2 you're sending all the links to one worker. Instead of
exec.submit(img_loader, url, links, path)
you'd need to:
for link in links:
exec.submit(img_loader, url, [link], path)
I didn't try it out myself, that's just from reading the documentation of ThreadPoolExecutor
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With