Scrapy - how to manage cookies/sessions

The way the website works:

When you visit the website you get a session cookie.

When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.

My script:

My spider has a start url of searchpage_url

The searchpage is requested by parse() and the search form response gets passed to search_generator()

search_generator() then yields lots of search requests using FormRequest and the search form response.

Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.

I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?

If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?

I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?

I'm confused, any clarification would be greatly received!

EDIT:

Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.

I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.

Is this what you should do in this situation?

343

asked Feb 12 '11 23:02

Acorn

1 Answers

Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

Just use something like this in your spider's start_requests method:

for i, url in enumerate(urls):     yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},         callback=self.parse_page)

And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:

def parse_page(self, response):     # do some processing     return scrapy.Request("http://www.example.com/otherpage",         meta={'cookiejar': response.meta['cookiejar']},         callback=self.parse_other_page)

197

answered Sep 27 '22 23:09

Noah_S

Related questions
                            
                                Parse HTML table to Python list?
                            
                                How do I download NLTK data?
                            
                                CSS Problems with Flask Web App
                            
                                NameError: name 'request' is not defined
                            
                                Reliably detect Windows in Python
                            
                                error: could not create '/usr/local/lib/python2.7/dist-packages/virtualenv_support': Permission denied
                            
                                Remove a tag using BeautifulSoup but keep its contents
                            
                                Matplotlib plot is a no-show
                            
                                Are objects with the same id always equal when comparing them with ==?
                            
                                Bit length of a positive integer in Python
                            
                                Skipping every other element after the first
                            
                                Pandas (python): How to add column to dataframe for index?
                            
                                Capitalize a string
                            
                                Python's requests "Missing dependencies for SOCKS support" when using SOCKS5 from Terminal
                            
                                How to determine whether a year is a leap year?
                            
                                Optimising Python dictionary access code
                            
                                Is there a way to "compile" Python code onto an Arduino (Uno)?
                            
                                When does the App Engine scheduler use a new thread vs. a new instance?
                            
                                Python package structure, setup.py for running unit tests
                            
                                Is there a consensus what should be documented in the classes and __init__ docstrings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy - how to manage cookies/sessions

Tags:

python

cookies

session

session-cookies

scrapy