Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Persist authenticated session between crawls for development in Scrapy

I'm using a Scrapy spider that authenticates with a login form upon launching. It then scrapes with this authenticated session.

During development I usually run the spider many times to test it out. Authenticating at the beginning of each run spams the login form of the website. The website will often force a password reset in response and I suspect it will ban the account if this continues.

Because the cookies last a number of hours, there's no good reason to log in this often during development. To get around the password reset problem, what would be the best way to re-use an authenticated session/cookies between runs while developing? Ideally the spider would only attempt to authenticate if the persisted session has expired.

Edit:

My structure is like:

def start_requests(self):
        yield scrapy.Request(self.base, callback=self.log_in)

def log_in(self, response):
        #response.headers includes 'Set-Cookie': 'JSESSIONID=xx'; Path=/cas/; Secure; HttpOnly'
        yield scrapy.FormRequest.from_response(response,
                                        formdata={'username': 'xxx',
                                                     'password':''},
                                          callback=self.logged_in)
def logged_in(self, response):
        #request.headers and subsequent requests all have headers fields 'Cookie': 'JSESSIONID=xxx';
        #response.headers has no mention of cookies
        #request.cookies is empty

When I run the same page request in Chrome, under the 'Cookies' tab there are ~20 fields listed.

The documentation seems thin here. I've tried setting a field 'Cookie': 'JSESSIONID=xxx' on the headers dict of all outgoing requests based on the values returned by a successful login, but this bounces back to the login screen

like image 448
Regan Avatar asked Jun 29 '16 16:06

Regan


1 Answers

Turns out that for an ad-hoc development solution, this is easier to do than I thought. Get the cookie string with cookieString = request.headers['Cookie'], save, then on subsequent outgoing requests load it up and do:

request.headers.appendlist('Cookie', cookieString)
like image 64
Regan Avatar answered Nov 16 '22 00:11

Regan