I am currently using Selenium to crawl data from some websites. Unlike urllib, it seems that I do not really need a parser like BeautifulSoup to parse the HTML. I can simply find an element with Selenium and use Webelement.text to get the data that I need. As I saw there are some people using Selenium and BeautifulSoup together in web crawling. Is it really necessary? Any special features that bs4 can offer to improve the crawling process? Thank you.
Selenium itself is quite powerful in terms of locating elements and, it basically has everything you need for extracting data from HTML. The problem is, it is slow. Every single selenium command goes through the JSON wire HTTP protocol and there is a substantial overhead.
In order to improve the performance of the HTML parsing part, it is usually much faster to let BeautifulSoup
or lxml
parse the page source retrieved from .page_source
.
In other words, a common workflow for a dynamic web page is something like:
driver.page_source
and close the browserIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With