Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PYTHON SCRAPY Can't POST information to FORMS,

I think that I will ask very big favor as i struggling with this problem several days. I tried all possible (in my best knowledge) ways and still no result. I am doing something wrong, but still can't figure out what it is. So thank you every one who are willing enough to go to this adventure. First things first: I am trying to use POST method to post information to the form that is on delta.com As always with this websites it is complicated as they are in to the sessions and cookies and Javascript so it can be problem there. I am using code example that I found in stackoverflow: Using MultipartPostHandler to POST form-data with Python And here is my code that I tweaked for delta web page.

from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from delta.items import DeltaItem
from scrapy.contrib.spiders import CrawlSpider, Rule


class DmozSpider(CrawlSpider):
    name = "delta"
    allowed_domains = ["http://www.delta.com"]
    start_urls = ["http://www.delta.com"]

    def start_requests(self, response):
        yield FormRequest.from_response(response, formname='flightSearchForm',url="http://www.delta.com/booking/findFlights.do", formdata={'departureCity[0]':'JFK', 'destinationCity[0]':'SFO','departureDate[0]':'07.20.2013','departureDate[1]':'07.28.2013','paxCount':'1'},callback=self.parse1)

    def parse1(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//')
        items = []
        for site in sites:
            item = DeltaItem()
            item['title'] = site.select('text()').extract()
            item['link'] = site.select('text()').extract()
            item['desc'] = site.select('text()').extract()
            items.append(item)
        return items

When I instruct spider to crawl in terminal I see:

 scrapy crawl delta -o items.xml  -t xml

2013-07-01 13:39:30+0300 [scrapy] INFO: Scrapy 0.16.2 started (bot: delta)
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-01 13:39:30+0300 [delta] INFO: Spider opened
2013-07-01 13:39:30+0300 [delta] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 13:39:33+0300 [delta] DEBUG: Crawled (200) <GET http://www.delta.com> (referer: None)
2013-07-01 13:39:33+0300 [delta] INFO: Closing spider (finished)
2013-07-01 13:39:33+0300 [delta] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 219,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 27842,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 1, 10, 39, 33, 159235),
     'log_count/DEBUG': 7,
     'log_count/INFO': 4,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2013, 7, 1, 10, 39, 30, 734090)}
2013-07-01 13:39:33+0300 [delta] INFO: Spider closed (finished)

If you compare with example from link I don't see that I managed to make POST method even when I am using almost the same code. I even tried with very simple HTML/PHP form from W3schools that I placed on server, but the same there. What ever I did never managed to create POST. I think the problem is simple, but as only Python knowledge that I have is Scrapy and all Scrapy is what i found on-line(I it is well documented) and from examples, but still it is not enough for me. So if any one at least could show the right way it would be very big help.

like image 685
Vy.Iv Avatar asked Jul 01 '13 11:07

Vy.Iv


1 Answers

Here's a working example of using Request.from_response for delta.com:

from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider


class DeltaItem(Item):
    title = Field()
    link = Field()
    desc = Field()


class DmozSpider(BaseSpider):
    name = "delta"
    allowed_domains = ["delta.com"]
    start_urls = ["http://www.delta.com"]

    def parse(self, response):
        yield FormRequest.from_response(response,
                                        formname='flightSearchForm',
                                        formdata={'departureCity[0]': 'JFK',
                                                  'destinationCity[0]': 'SFO',
                                                  'departureDate[0]': '07.20.2013',
                                                  'departureDate[1]': '07.28.2013'},
                                        callback=self.parse1)

    def parse1(self, response):
        print response.status

You've used wrong spider methods, plus allowed_domains was incorrectly set.

But, anyway, delta.com heavily uses dynamic ajax calls for loading the content - here's where your problems start. E.g. response in parse1 method doesn't contain any search results - instead it contains an html for loading AWAY WE GO. ARRIVING AT YOUR FLIGHTS SOON page where results are loaded dynamically.

Basically, you should work with your browser developer tools and try to simulate those ajax calls inside your spider or use tools like selenium which uses the real browser (and you can combine it with scrapy).

See also:

  • Scraping ajax pages using python
  • Can scrapy be used to scrape dynamic content from websites that are using AJAX?
  • Pagination using scrapy

Hope that helps.

like image 158
alecxe Avatar answered Oct 18 '22 22:10

alecxe