We've been using scrapy-splash
middleware to pass the scraped HTML source through the Splash
javascript engine running inside a docker container.
If we want to use Splash in the spider, we configure several required project settings and yield a Request
specifying specific meta
arguments:
yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # overrides SPLASH_URL 'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN, } })
This works as documented. But, how can we use scrapy-splash
inside the Scrapy Shell?
just wrap the url you want to shell to in splash http api.
So you would want something like:
scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'
where localhost:port
is where your splash service is runningurl
is url you want to crawl and dont forget to urlquote it!render.html
is one of the possible http api endpoints, returns redered html page in this casetimeout
time in seconds for timeoutwait
time in seconds to wait for javascript to execute before reading/saving the html.
You can run scrapy shell
without arguments inside a configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...)
and call fetch(req)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With