Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Shell and Scrapy Splash

Tags:

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container.

If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments:

yield Request(url, self.parse_result, meta={     'splash': {         'args': {             # set rendering arguments here             'html': 1,             'png': 1,              # 'url' is prefilled from request url         },          # optional parameters         'endpoint': 'render.json',  # optional; default is render.json         'splash_url': '<url>',      # overrides SPLASH_URL         'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,     } }) 

This works as documented. But, how can we use scrapy-splash inside the Scrapy Shell?

like image 245
alecxe Avatar asked Feb 11 '16 23:02

alecxe


2 Answers

just wrap the url you want to shell to in splash http api.

So you would want something like:

scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5' 

where localhost:port is where your splash service is running
url is url you want to crawl and dont forget to urlquote it!
render.html is one of the possible http api endpoints, returns redered html page in this case
timeout time in seconds for timeout
wait time in seconds to wait for javascript to execute before reading/saving the html.

like image 162
Granitosaurus Avatar answered Oct 12 '22 10:10

Granitosaurus


You can run scrapy shell without arguments inside a configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...) and call fetch(req).

like image 30
Mikhail Korobov Avatar answered Oct 12 '22 10:10

Mikhail Korobov