Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selenium vs Jsoup performance

I am doing a bit of web scraping, made with Selenium (so the use of this is not the question). When I have to identify an element (ie: to get a src attribute) should I use the Selenium internal selecting engine or should I use Jsoup (which is a lot easier). So the questione is: is the use of Jsoup so performance considerable? Should I use selenium as often as possible? Thanks

like image 270
fabio_vac Avatar asked Nov 29 '15 11:11

fabio_vac


People also ask

Is Selenium or Beautifulsoup faster?

Developers should keep in mind some drawbacks when using Selenium for their web scraping projects. The most noticeable disadvantage is that it's not as fast as Beautiful Soup's HTTPS requests.

Is Scrapy faster than selenium?

Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.


1 Answers

If you have the DOM parsed already into JSoup, then I would recommend using JSoup. It is much faster than selenium, since it does not need to bother with a "living" DOM. Selenium must always check if the element handles are still valid before doing any operations with them.

If you can, avoid selenium altogether, since its overhead is really noticeable when you do serious scraping. Selenium shines, however, if your content is dynamically generated by JavaScript in the client. JSoup can't handle this at all, since it does not execute JavaScript.

Addendum to answer a comment

Short answer : It depends!

Longer: If the website you are scraping is generated by JavaScript and it does not change after generation, it is perfectly fine to access it with selenium, especially, if the DOM is complex and would take long to read into JSoup, although JSoup is fairly fast. However, JSoup will generate the DOM in memory again, so if your DOM is huge you will not only have it in a memory consuming way in selenium, but also in JSoup. This may or may not be an issue in your case, but it is worth keeping in mind.

From my personal experience I would kill the selenium process as soon as possible after getting the final HTML and parse this in JSoup again, since it is as you say: Jsoup scraping is way easier than the corresponding selenium selector constructs, especially if you are sure that any changes in the DOM after the initial creation are irrelevant to your scraping.

like image 63
luksch Avatar answered Sep 25 '22 06:09

luksch