Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python selenium multiprocessing

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try (it's a working one):

import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
  res = requests.get(link)
  soup = BeautifulSoup(res.text,"lxml")
  titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
  return titles

def get_title(url):
  chromeOptions = webdriver.ChromeOptions()
  chromeOptions.add_argument("--headless")
  driver = webdriver.Chrome(chrome_options=chromeOptions)
  driver.get(url)
  sauce = BeautifulSoup(driver.page_source,"lxml")
  item = sauce.select_one("h1 a").text
  print(item)

if __name__ == '__main__':
  url = "https://stackoverflow.com/questions/tagged/web-scraping"
  ThreadPool(5).map(get_title,get_links(url))
like image 296
robots.txt Avatar asked Nov 26 '18 06:11

robots.txt


2 Answers

My question: how can I reduce the execution time?

Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.

For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.

Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).

import scrapy
class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self, response):
        for title in response.css('.summary .question-hyperlink'):
            yield title.get('href')

To run put this into blogspider.py and run

$ scrapy runspider blogspider.py

See the Scrapy website for a complete tutorial.

Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.

like image 152
miraculixx Avatar answered Nov 06 '22 06:11

miraculixx


The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around. I would make the following changes:

  1. Use instead class Driver that will crate the driver instance and store it on the thread-local storage but also have a destructor that will quit the driver when the thread-local storage is deleted:
class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        #print('The driver has been "quitted".')
  1. create_driver now becomes:
threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver
  1. Finally, after you have no further use for the ThreadPool instance but before it is terminated, add the following lines to delete the thread-local storage and force the Driver instances' destructors to be called (hopefully):
del threadLocal
import gc
gc.collect() # a little extra insurance
like image 23
Booboo Avatar answered Nov 06 '22 05:11

Booboo