I'm using a Scrapy spider that authenticates with a login form upon launching. It then scrapes with this authenticated session.
During development I usually run the spider many times to test it out. Authenticating at the beginning of each run spams the login form of the website. The website will often force a password reset in response and I suspect it will ban the account if this continues.
Because the cookies last a number of hours, there's no good reason to log in this often during development. To get around the password reset problem, what would be the best way to re-use an authenticated session/cookies between runs while developing? Ideally the spider would only attempt to authenticate if the persisted session has expired.
Edit:
My structure is like:
def start_requests(self):
yield scrapy.Request(self.base, callback=self.log_in)
def log_in(self, response):
#response.headers includes 'Set-Cookie': 'JSESSIONID=xx'; Path=/cas/; Secure; HttpOnly'
yield scrapy.FormRequest.from_response(response,
formdata={'username': 'xxx',
'password':''},
callback=self.logged_in)
def logged_in(self, response):
#request.headers and subsequent requests all have headers fields 'Cookie': 'JSESSIONID=xxx';
#response.headers has no mention of cookies
#request.cookies is empty
When I run the same page request in Chrome, under the 'Cookies' tab there are ~20 fields listed.
The documentation seems thin here. I've tried setting a field 'Cookie': 'JSESSIONID=xxx'
on the headers dict of all outgoing requests based on the values returned by a successful login, but this bounces back to the login screen
Turns out that for an ad-hoc development solution, this is easier to do than I thought. Get the cookie string with cookieString = request.headers['Cookie']
, save, then on subsequent outgoing requests load it up and do:
request.headers.appendlist('Cookie', cookieString)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With