I think that I will ask very big favor as i struggling with this problem several days. I tried all possible (in my best knowledge) ways and still no result. I am doing something wrong, but still can't figure out what it is. So thank you every one who are willing enough to go to this adventure. First things first: I am trying to use POST method to post information to the form that is on delta.com As always with this websites it is complicated as they are in to the sessions and cookies and Javascript so it can be problem there. I am using code example that I found in stackoverflow: Using MultipartPostHandler to POST form-data with Python And here is my code that I tweaked for delta web page.
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from delta.items import DeltaItem
from scrapy.contrib.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
name = "delta"
allowed_domains = ["http://www.delta.com"]
start_urls = ["http://www.delta.com"]
def start_requests(self, response):
yield FormRequest.from_response(response, formname='flightSearchForm',url="http://www.delta.com/booking/findFlights.do", formdata={'departureCity[0]':'JFK', 'destinationCity[0]':'SFO','departureDate[0]':'07.20.2013','departureDate[1]':'07.28.2013','paxCount':'1'},callback=self.parse1)
def parse1(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//')
items = []
for site in sites:
item = DeltaItem()
item['title'] = site.select('text()').extract()
item['link'] = site.select('text()').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
When I instruct spider to crawl in terminal I see:
scrapy crawl delta -o items.xml -t xml
2013-07-01 13:39:30+0300 [scrapy] INFO: Scrapy 0.16.2 started (bot: delta)
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled item pipelines:
2013-07-01 13:39:30+0300 [delta] INFO: Spider opened
2013-07-01 13:39:30+0300 [delta] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 13:39:33+0300 [delta] DEBUG: Crawled (200) <GET http://www.delta.com> (referer: None)
2013-07-01 13:39:33+0300 [delta] INFO: Closing spider (finished)
2013-07-01 13:39:33+0300 [delta] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 27842,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 1, 10, 39, 33, 159235),
'log_count/DEBUG': 7,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 7, 1, 10, 39, 30, 734090)}
2013-07-01 13:39:33+0300 [delta] INFO: Spider closed (finished)
If you compare with example from link I don't see that I managed to make POST method even when I am using almost the same code. I even tried with very simple HTML/PHP form from W3schools that I placed on server, but the same there. What ever I did never managed to create POST. I think the problem is simple, but as only Python knowledge that I have is Scrapy and all Scrapy is what i found on-line(I it is well documented) and from examples, but still it is not enough for me. So if any one at least could show the right way it would be very big help.
Here's a working example of using Request.from_response
for delta.com
:
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class DeltaItem(Item):
title = Field()
link = Field()
desc = Field()
class DmozSpider(BaseSpider):
name = "delta"
allowed_domains = ["delta.com"]
start_urls = ["http://www.delta.com"]
def parse(self, response):
yield FormRequest.from_response(response,
formname='flightSearchForm',
formdata={'departureCity[0]': 'JFK',
'destinationCity[0]': 'SFO',
'departureDate[0]': '07.20.2013',
'departureDate[1]': '07.28.2013'},
callback=self.parse1)
def parse1(self, response):
print response.status
You've used wrong spider methods, plus allowed_domains
was incorrectly set.
But, anyway, delta.com
heavily uses dynamic ajax calls for loading the content - here's where your problems start. E.g. response
in parse1
method doesn't contain any search results - instead it contains an html for loading AWAY WE GO. ARRIVING AT YOUR FLIGHTS SOON
page where results are loaded dynamically.
Basically, you should work with your browser developer tools and try to simulate those ajax calls inside your spider or use tools like selenium which uses the real browser (and you can combine it with scrapy
).
See also:
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With