Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy-splash returns its own headers and not the original headers from the site

I use scrapy-splash to build my spider. Now what I need is to maintain the session, so I use the scrapy.downloadermiddlewares.cookies.CookiesMiddleware and it handles the set-cookie header. I know it handles the set-cookie header because i set COOKIES_DEBUG=True and this causes the printouts by CookeMiddleware regarding set-cookie header.

The problem: when I also add Splash to the picture the set-cookie printouts disappear, and in fact what I get as response headers is {'Date': ['Sun, 25 Sep 2016 12:09:55 GMT'], 'Content-Type': ['text/html; charset=utf-8'], 'Server': ['TwistedWeb/16.1.1']} Which is related to splash rendering engine which uses TwistedWeb.

Is there any directive to tell the splash also to give me the original response headers?

like image 730
Roman Smelyansky Avatar asked Sep 25 '16 12:09

Roman Smelyansky


1 Answers

To get original response headers you can write a Splash Lua script; see examples in scrapy-splash README:

Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values; lua_source argument value is cached on Splash server and is not sent with each request (it requires Splash 2.1+):

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):


    # ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
            headers={'X-My-Header': 'value'},
        )

    def parse_result(self, response):
        # here response.body contains result HTML;
        # response.headers are filled with headers from last
        # web page loaded to Splash;
        # cookies from all responses and from JavaScript are collected
        # and put into Set-Cookie response header, so that Scrapy
        # can remember them.

scrapy-splash also provides built-in helpers for cookie handling; they are enabled in this example as soon as scrapy-splash is configured as described in readme.

like image 187
Mikhail Korobov Avatar answered Oct 19 '22 16:10

Mikhail Korobov