I'm confused about parallel execution in python using selenium. There seems to be a few ways to go about it, but some seem out of date. <ol> <li> There's a python module called <code>python-wd-parallel</code> which seems to have some functionality to do this, but it's from 2013, is this still useful now? I also found this example. </li> <li> There's <code>concurrent.futures</code>, this seems a lot newer, but not so easy to implement. Anyone have a working example with parallel execution in selenium? </li> <li> There's also using just threads and executors to get the job done, but I feel this will be slower, because it's not using all the cores and is still running in serial formation. </li> </ol> What is the latest way to do parallel execution using selenium?

Use joblib's Parallel module to do that, its a great library for parallel execution. Lets say we have a list of urls named <code>urls</code> and we want to take a screenshot of each one in parallel First lets import the necessary libraries <pre class="prettyprint"><code>from selenium import webdriver from joblib import Parallel, delayed </code></pre> Now lets define a function that takes a screenshot as base64 <pre class="prettyprint"><code>def take_screenshot(url): phantom = webdriver.PhantomJS('/path/to/phantomjs') phantom.get(url) screenshot = phantom.get_screenshot_as_base64() phantom.close() return screenshot </code></pre> Now to execute that in parallel what you would do is <pre class="prettyprint"><code>screenshots = Parallel(n_jobs=-1)(delayed(take_screenshot)(url) for url in urls) </code></pre> When this line will finish executing, you will have in <code>screenshots</code> all of the data from all of the processes that ran. Explanation about Parallel <ul> <li> <code>Parallel(n_jobs=-1)</code> means use all of the resources you can</li> <li> <code>delayed(function)(input)</code> is <code>joblib</code>'s way of creating the input for the function you are trying to run on parallel</li> </ul> More information can be found on the <code>joblib</code> docs

Python parallel execution with selenium

2 Answers

Use joblib's Parallel module to do that, its a great library for parallel execution.

Lets say we have a list of urls named urls and we want to take a screenshot of each one in parallel

First lets import the necessary libraries

from selenium import webdriver
from joblib import Parallel, delayed

Now lets define a function that takes a screenshot as base64

def take_screenshot(url):
    phantom = webdriver.PhantomJS('/path/to/phantomjs')
    phantom.get(url)
    screenshot = phantom.get_screenshot_as_base64()
    phantom.close()

    return screenshot

Now to execute that in parallel what you would do is

screenshots = Parallel(n_jobs=-1)(delayed(take_screenshot)(url) for url in urls)

When this line will finish executing, you will have in screenshots all of the data from all of the processes that ran.

Explanation about Parallel

Parallel(n_jobs=-1) means use all of the resources you can
delayed(function)(input) is joblib's way of creating the input for the function you are trying to run on parallel

More information can be found on the joblib docs

answered Sep 21 '22 16:09

bluesummers

Python Parallel Wd seams to be dead from its github (last commit 9 years ago). Also it implements an obsolete protocol for selenium. Still I haven't tested it I wouldn't recommend.

Selenium Performance Boost (concurrent.futures)

Short Answer

Both threads and processes will give you a considerable speed up on your selenium code.

Short examples are given bellow. The selenium work is done by selenium_title function that return the page title. That don't deal with exceptions happening during each thread/process execution. For that look Long Answer - Dealing with exceptions.

Pool of thread workers concurrent.futures.ThreadPoolExecutor.

from selenium import webdriver  
from concurrent import futures

def selenium_title(url):  
  wdriver = webdriver.Chrome() # chrome webdriver
  wdriver.get(url)  
  title = wdriver.title  
  wdriver.quit()
  return title

links = ["https://www.amazon.com", "https://www.google.com"]

with futures.ThreadPoolExecutor() as executor: # default/optimized number of threads
  titles = list(executor.map(selenium_title, links))

Pool of processes workers concurrent.futures.ProcessPoolExecutor. Just need to replace ThreadPoolExecuter by ProcessPoolExecutor in the code above. They are both derived from the Executor base class. Also you must protect the main, like below.

if __name__ == '__main__':
 with futures.ProcessPoolExecutor() as executor: # default/optimized number of processes
   titles = list(executor.map(selenium_title, links))

Long Answer

Why `Threads` with Python GIL works?

Even tough Python has limitations on threads due the Python GIL and even though threads will be context switched. Performance gain will come due to implementation details of Selenium. Selenium works by sending commands like POST, GET (HTTP requests). Those are sent to the browser driver server. Consequently you might already know I/O bound tasks (HTTP requests) releases the GIL, so the performance gain.

Dealing with exceptions

We can make small modifications on the example above to deal with Exceptions on the threads spawned. Instead of using executor.map we use executor.submit. That will return the title wrapped on Future instances.

To access the returned title we can use future_titles[index].result where index size len(links), or simple use a for like bellow.

with futures.ThreadPoolExecutor() as executor:
  future_titles = [ executor.submit(selenium_title, link) for link in links ]
  for future_title, link in zip(future_titles, links): 
    try:        
      title = future_title.result() # can use `timeout` to wait max seconds for each thread               
    except Exception as exc: # this thread migh have had an exception
      print('url {:0} generated an exception: {:1}'.format(link, exc))

Note that besides iterating over future_titles we iterate over links so in case an Exception in some thread we know which url(link) was responsible for that.

The futures.Future class are cool because they give you control on the results received from each thread. Like if it completed correctly or there was an exception and others, more about here.

Also important to mention is that futures.as_completed is better if you don´t care which order the threads return items. But since the syntax to control exceptions with that is a little ugly I omitted it here.

Performance gain and Threads

First why I've been always using threads for speeding up my selenium code:

On I/O bound tasks my experience with selenium shows that there's minimal or no diference between using a pool of Processes (Process) or Threads (Threads). Here also reach similar conclusions about Python threads vs processes on I/O bound tasks.
We also know that processes use their own memory space. That means more memory consumption. Also processes are a little slower to be spawned than threads.

answered Sep 21 '22 16:09

iambr

Related questions
                            
                                Can you fix the false negative rate in a classifier in scikit learn
                            
                                How do I download Anaconda packages without "installing" them?
                            
                                Compiling & installing C executable using python's setuptools/setup.py?
                            
                                How are variables names stored and mapped internally?
                            
                                import m2m relation in django-import-export
                            
                                How do I fix a dimension error in TensorFlow?
                            
                                Idioms in python: closure vs functor vs object
                            
                                What pylint options can be specified in inline comments?
                            
                                How can I create an argparse mutually exclusive group with multiple positional parameters?
                            
                                How do you count cars in OpenCV with Python?
                            
                                How does Apache spark handle python multithread issues?
                            
                                Syntaxnet / Parsey McParseface python API
                            
                                What is the proper way of testing throttling in DRF?
                            
                                Python Profiling: What does "method 'poll' of 'select.poll' objects"?
                            
                                TensorFlow freeze_graph.py: The name 'save/Const:0' refers to a Tensor which does not exist
                            
                                Binning of data along one axis in numpy
                            
                                Selenium chromedriver 2.25 TimeoutException cannot determine loading status
                            
                                How to query an advanced search with google customsearch API?
                            
                                "pip install jq" generates errors on Mac and Windows
                            
                                Python3 does not find modules installed by pip3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python parallel execution with selenium

Tags:

python

parallel-processing

selenium

concurrent.futures

Ke.

People also ask

2 Answers

bluesummers

Selenium Performance Boost (concurrent.futures)

Short Answer

Long Answer

Why `Threads` with Python GIL works?

Dealing with exceptions

Performance gain and Threads

iambr

Recent Activity

Donate For Us

Python parallel execution with selenium

Tags:

python

parallel-processing

selenium

concurrent.futures

Ke.

People also ask

2 Answers

bluesummers

Selenium Performance Boost (concurrent.futures)

Short Answer

Long Answer

Why Threads with Python GIL works?

Dealing with exceptions

Performance gain and Threads

iambr

Related questions

Recent Activity

Donate For Us

Why `Threads` with Python GIL works?