So far, I have been using just scrapy and writing custom classes to deal with websites using ajax. But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly? What would be the comparison between time it takes to scrape a vanilla html page with scrapy vs javascript rendered html with scrapy-splash? And lastly, how do scrapy-splash and Selenium compare?

It depends on the amount of javascript present on the page. You must know that to render all the javascript the splash takes some time and the python application proceeds without waiting for the rendering to be complete. So sometimes splash is also not able to do it. <ul> <li>You can explicitly put a wait for rendering as it needs some time generally.</li> <li>Also it is a good practice to put up some <code>wait</code>.</li> </ul> Here, <pre class="prettyprint"><code>import scrapy from scrapy_splash import SplashRequest yield scrapy.Request(url, callback=self.parse, meta={'splash':{'args':{'wait':'25'},'endpoint':'render.html'}}) </code></pre> or <pre class="prettyprint"><code>import scrapy from scrapy_splash import SplashRequest yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 5, 'html' : 1 } ) </code></pre> <h3>Between scrapy and selenium</h3> <code>Selenium</code> is only used to automate web browser interaction, <code>Scrapy</code> is used to download HTML, process data and save it(whole web crawling framework). Talking about scraping I would recommend <code>scrapy</code> and if the problem is javascript. <ul> <li>Scrapy already has its own official project for javascript called scrapy-splash </li> <li>Also, you can create new instance of webdriver from Selenium in the scrapy spider, do some work, extract the data, and then close it after all work done.</li> </ul>

Does using scrapy-splash significantly affect scraping speed? [closed]

1 Answers

It depends on the amount of javascript present on the page.

You must know that to render all the javascript the splash takes some time and the python application proceeds without waiting for the rendering to be complete. So sometimes splash is also not able to do it.

You can explicitly put a wait for rendering as it needs some time generally.
Also it is a good practice to put up some wait.

Here,

import scrapy
from scrapy_splash import SplashRequest

yield scrapy.Request(url, callback=self.parse, meta={'splash':{'args':{'wait':'25'},'endpoint':'render.html'}})

import scrapy
from scrapy_splash import SplashRequest

yield SplashRequest(url, self.parse, endpoint='render.html',
        args={'wait': 5, 'html' : 1 } )

Between scrapy and selenium

Selenium is only used to automate web browser interaction, Scrapy is used to download HTML, process data and save it(whole web crawling framework).

Talking about scraping I would recommend scrapy and if the problem is javascript.

Scrapy already has its own official project for javascript called scrapy-splash
Also, you can create new instance of webdriver from Selenium in the scrapy spider, do some work, extract the data, and then close it after all work done.

answered Oct 18 '22 10:10

Nandesh

Related questions
                            
                                Improve performance of converting numpy array to MATLAB double
                            
                                Python static method is not always callable
                            
                                Setup in virtualenv: `pip install -e .` vs `python setup.py install`
                            
                                Sorting a list: numbers in ascending, letters in descending
                            
                                Merge MultiIndex columns together into 1 level [duplicate]
                            
                                Python Keras LSTM learning converges too fast on high loss
                            
                                python -docx to extract table from word docx
                            
                                How to get Predictions with XGBoost and XGBoost using Scikit-Learn Wrapper to match?
                            
                                Numpy: assigning values to 2d array with list of indices
                            
                                Django - Supervisor : exited too quickly
                            
                                How to setup working directory in VS Code for pylint?
                            
                                Find locations on a curve where the slope changes
                            
                                Python Pandas groupby apply lambda arguments
                            
                                Efficient way to compute the Vandermonde matrix
                            
                                How to import data into google colab from google drive?
                            
                                ImportError: No module named google.oauth2
                            
                                'DataFrame' object has no attribute 'ravel' when transforming target variable?
                            
                                Train only some word embeddings (Keras)
                            
                                Inserting NULL as default in SQLAlchemy?
                            
                                K.gradients(loss, input_img)[0] return "None". (Keras CNN visualization with tensorflow backend)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does using scrapy-splash significantly affect scraping speed? [closed]

Tags:

python

selenium

web-scraping

scrapy

scrapy-splash

hsy

People also ask

1 Answers

Between scrapy and selenium

Nandesh

Recent Activity

Donate For Us