I am trying to scrape search results from website that uses a __doPostBack
function. The webpage displays 10 results per search query. To see more results, one has to click a button that triggers a __doPostBack
javascript. After some research, I realized that the POST request behaves just like a form, and that one could simply use scrapy's FormRequest
to fill that form. I used the following thread:
Troubles using scrapy with javascript __doPostBack method
to write the following script.
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.selector import Selector
from ahram.items import AhramItem
import re
class MySpider(CrawlSpider):
name = u"el_ahram2"
def start_requests(self):
search_term = u'اقتصاد'
baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1'
requests = []
for i in range(1, 4):#crawl first 3 pages as a test
argument = u"'Page$"+ str(i+1) + u"'"
data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument}
currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles)
requests.append(currentPage)
return requests
def fetch_articles(self, response):
sel = Selector(response)
for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract():
yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items)
def parse_items(self, response):
sel = Selector(response)
the_title = ' '.join(sel.xpath("//title/text()").extract()).replace('\n','').replace('\r','').replace('\t','')#* mean 'anything'
the_authors = '---'.join(sel.xpath("//*[contains(@id,'editorsdatalst_HyperLink')]//text()").extract())
the_text = ' '.join(sel.xpath("//span[@id='TextBox2']/text()").extract())
the_month_year = ' '.join(sel.xpath("string(//span[@id = 'Label1'])").extract())
the_day = ' '.join(sel.xpath("string(//span[@id = 'Label2'])").extract())
item = AhramItem()
item["Authors"] = the_authors
item["Title"] = the_title
item["MonthYear"] = the_month_year
item["Day"] = the_day
item['Text'] = the_text
return item
My problem now is that 'fetch_articles' is never called:
2014-05-27 12:19:12+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] INFO: Closing spider (finished)
After searching for several days I feel completely stuck. I am a beginner in python, so perhaps the error is trivial. However if it is not, this thread could be of use to a number of people. Thank you in advance for you help.
Your code is fine. fetch_articles
is running. You can test it by adding a print statement.
However, the website requires you to validate POST requests. In order to validate them, you must have __EVENTVALIDATION
and __VIEWSTATE
in your request body to prove you are responding to their form. In order to get these, you need to first make a GET request, and extract these fields from the form. If you don't provide this, you get an error page instead, which did not contain any links with "checkpart.aspx?Serial=", so your for
loop was not being executed.
Here is how I've setup the start_request
, and then fetch_search
does what start_request
used to do.
class MySpider(CrawlSpider):
name = u"el_ahram2"
def start_requests(self):
search_term = u'اقتصاد'
baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1'
SearchPage = Request(baseUrl, callback = self.fetch_search)
return [SearchPage]
def fetch_search(self, response):
sel = Selector(response)
search_term = u'اقتصاد'
baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1'
viewstate = sel.xpath("//input[@id='__VIEWSTATE']/@value").extract().pop()
eventvalidation = sel.xpath("//input[@id='__EVENTVALIDATION']/@value").extract().pop()
for i in range(1, 4):#crawl first 3 pages as a test
argument = u"'Page$"+ str(i+1) + u"'"
data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument, '__VIEWSTATE': viewstate, '__EVENTVALIDATION': eventvalidation}
currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles)
yield currentPage
...
def fetch_articles(self, response):
sel = Selector(response)
print response._get_body() # you can write to file and do an grep
for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract():
yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items)
I could not find the "checkpart.aspx?Serial=" which you are searching for.
This might not solve your issue, but using answer instead of the comment for the code formatting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With