I want scrapy to crawl pages where going on to the next link looks like this:
<a href="#" onclick="return gotoPage('2');"> Next </a>
Will scrapy be able to interpret javascript code of that?
With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:
encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n
I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:
def logon(self, response):
    login_form_data={ 'email': '[email protected]', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]
And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?
All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.
The actual methodology will be as follows:
All this have to be streamlined with the server response mechanism, e.g:
dont_click = true in FormRequest.from_response Now how to figure it all out: Use a web debugger like fiddler or you can use Firefox plugin FireBug, or simply hit F12 in IE 9; and check the requests a user actually makes on the website match the way you are crawling the webpage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With