Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving cookies between scrapy scrapes

Tags:

scrapy

I'm collecting data from a site on a daily basis. Each day I run scrapy and the first request always gets redirected to the sites homepage because it seems scrapy doesnt have any cookies set yet. However after the first request,scrapy receives the cookie and from then on works just fine.

This however makes it very difficult for me to use tools like "scrapy view" etc with any particular url because the site will always redirect to the home page and thats what scrapy will open in my browser.

Can scrapy save the cookie and I specify to use it on all scrapes? Can I specify to use it with scrapy view etc.

like image 967
robodisco Avatar asked Nov 11 '22 03:11

robodisco


1 Answers

There is no builtin mechanism to persist cookies between scrapy runs, but you can build it yourself (source code just to demonstrate the idea, not tested):

Step 1: Writing the cookies to file.

Get the cookie from the response header 'Set-Cookie' in your parse function. Then just serialize it into a file.

There are several ways how to do this explained here: Access session cookie in scrapy spiders

I prefer the direct approach:

# in your parse method ...
# get cookies
cookies = ";".join(response.headers.getlist('Set-Cookie'))
cookies = cookies.split(";")
cookies = { cookie.split("=")[0]: cookie.split("=")[1] for cookie in cookies }
# serialize cookies
# ... 

Ideally this should be done with the last response your scraper receives. Serialize the cookies that come with each response into the same file, overwriting the cookies you serialized during processing previous responses.

Step 2: Reading and using cookies from file

To use the cookies after loading it from the file you just have to pass them into the first Request you do as 'cookies' parameter:

def start_requests(self):
    old_cookies #= deserialize_cookies(xyz)
    return Request(url, cookies=old_cookies, ...)
like image 133
Done Data Solutions Avatar answered Jan 04 '23 03:01

Done Data Solutions