So far, I have been using just scrapy and writing custom classes to deal with websites using ajax.
But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly?
What would be the comparison between time it takes to scrape a vanilla html page with scrapy vs javascript rendered html with scrapy-splash?
And lastly, how do scrapy-splash and Selenium compare?
Splash is our in-house solution for JavaScript rendering, implemented in Python using Twisted and QT. Splash is a lightweight web browser that is capable of processing multiple pages in parallel, executing custom JavaScript in the page context, and much more.
Speed. Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once.
Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.
If you are dealing with a complex scraping operation that requires huge speed and complexities, then you should prefer Scrapy and if you're new to programming and want to work with web scraping projects then Beautiful Soup is good as you can easily learn it and able to perform the operations very quickly.
It depends on the amount of javascript present on the page.
You must know that to render all the javascript the splash takes some time and the python application proceeds without waiting for the rendering to be complete. So sometimes splash is also not able to do it.
wait
.Here,
import scrapy
from scrapy_splash import SplashRequest
yield scrapy.Request(url, callback=self.parse, meta={'splash':{'args':{'wait':'25'},'endpoint':'render.html'}})
or
import scrapy
from scrapy_splash import SplashRequest
yield SplashRequest(url, self.parse, endpoint='render.html',
args={'wait': 5, 'html' : 1 } )
Selenium
is only used to automate web browser interaction, Scrapy
is used to download HTML, process data and save it(whole web crawling framework).
Talking about scraping I would recommend scrapy
and if the problem is javascript.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With