How can I scrape this website? How would I send a post request using payload and get data from it?
If I use this code I am able to scrape first page but how would I scrape the second page? Do I need to use selenium or is scrapy enough for this?
import scrapy
from scrapy import log
from scrapy.http import *
import urllib2
class myntra_spider(scrapy.Spider):
name="myntra"
allowed_domain=[]
start_urls=["http://www.myntra.com/men-footwear"]
logfile=open('testlog.log','w')
log_observer=log.ScrapyFileLogObserver(logfile,level=log.ERROR)
log_observer.start()
# sub_category=[]
def parse(self,response):
print "response url ",response.url
link=response.xpath("//ul[@class='results small']/li/a/@href").extract()
print links
yield Request('http://www.myntra.com/search-service/searchservice/search/filteredSearch', callback=self.nextpages,body="")
def nextpages(self,response):
link=response.xpath("//ul[@class='results small']/li/a/@href").extract()
for i in range(10):
print "link ",link[i]
Using FormRequest. You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
You do not need Selenium for this. Check the payload required to be sent along with the request in your browser and attach it with the request.
I tried it with your site, the following snippet works -
def start_requests(self):
url = "http://www.myntra.com/search-service/searchservice/search/filteredSearch"
payload = [{
"query": "(global_attr_age_group:(\"Adults-Unisex\" OR \"Adults-Women\") AND global_attr_master_category:(\"Footwear\"))",
"start": 0,
"rows": 96,
"facetField": [],
"pivotFacets": [],
"fq": ["count_options_availbale:[1 TO *]"],
"sort": [
{"sort_field": "count_options_availbale", "order_by": "desc"},
{"sort_field": "score", "order_by": "desc"},
{"sort_field": "style_store1_female_sort_field", "order_by": "desc"},
{"sort_field": "potential_revenue_female_sort_field", "order_by": "desc"},
{"sort_field": "global_attr_catalog_add_date", "order_by": "desc"}
],
"return_docs": True,
"colour_grouping": True,
"useCache": True,
"flatshot": False,
"outOfStock": False,
"showInactiveStyles": False,
"facet": True
}]
yield Request(url, self.parse, method="POST", body=json.dumps(payload))
def parse(self, response):
data = json.loads(response.body)
print data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With