class HomedepotcrawlSpider(scrapy.Spider):
name = 'homeDepotCrawl'
allowed_domains = ['homedepot.com']
start_urls = ['https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=0']
def parse(self, response):
#call home depot function
for item in self.parseHomeDepot(response):
yield item
nextPageSelect = '.hd-pagination__link'
next_page = response.css(nextPageSelect).getall()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
Edit-
The way I got .hd-pagination__link was using a CSS selector extension for google chrome and selected the next page icon (Screenshot attached)
So I've tried a few things and this is the way that made the most sense to me and I think that I'm just grabbing the wrong object for the next page. As of right now my program only grabs the data from the first page and seems like the code block to traverse pages is being ignored.
I found a pattern with the URL where page numbers are denoted within increments of 24 (Maybe due to items numbers?). EX:
Page 1: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=0
Page 2: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=24
page 3: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=48 . . . . . . .
page n: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=[(n*24) - 24]
ect.
When I tried testing out code related to page numbers[incrementing the number after Na0 = x], I would just loop through the first page x amount of times. (My output would be the first page (24 items) repeated x amount of times.
I've also looked into crawl spider but couldn't really understand it/implementation.
Any help with my code/clarification on other methods would be appreciated!
Also this is not my whole program, I'm keeping out my parseHomeDepot function because I don't think it is necessary, but if the code is needed, just let me know!
Seems to me like you have a couple issues.
First of all, you may be getting the whole html element that contains the link for the next page, whereas what you're looking for is the link only. So I suggest you use the css selector like so:
nextPageSelect = '.hd-pagination__link::attr(href)'
This will get you the links instead of the whole HTML element. I suggest looking further into css selectors here.
Secondly, there seems to be an issue with your code, logically.
next_page = response.css(nextPageSelect).getall()
this piece of code gets you a list of all the 'next page' links on your current page, but you treat the whole list as one link. I suggest a for
loop. Something like this:
if next_pages:
for page in next_pages:
yield scrapy.Request(
response.urljoin(page),
callback=self.parse
)
Now moving on, I think to better make use of Scrapy's parallel and concurrency features, you may want to return a list of 'scrapy.Requests' instead of doing a yield
for every request you find. So to summarize:
nextPageSelect = '.hd-pagination__link::attr(href)'
next_pages = response.css(nextPageSelect).getall()
requests = []
if next_pages:
for page in next_pages:
requests.append(scrapy.Request(
response.urljoin(page),
callback=self.parse
))
return requests
Good luck!
Here is some working code for what you want to do:
import scrapy
from urllib.parse import urlsplit, urljoin
class HomedepotSpider(scrapy.Spider):
name = 'homedepot'
start_urls = ['https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=0']
def parse(self, response):
# Here you do something with your items
next_page = response.css('a.hd-pagination__link[title=Next]::attr(href)').get()
if next_page is not None:
o = urlsplit(response.url)
base_url = f'{o.scheme}://{o.netloc}'
next_page_url = urljoin(base_url,next_page)
yield response.follow(next_page_url, callback=self.parse)
The main things I would point you to in this code are:
response.url
) and then appends this relative you are getting to that with urljoinAlso here is the scrapy logs showing it has crawled 31 pages. This is what you should get if you execute it
2020-02-21 10:42:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 44799,
'downloader/request_count': 31,
'downloader/request_method_count/GET': 31,
'downloader/response_bytes': 1875031,
'downloader/response_count': 31,
'downloader/response_status_count/200': 31,
'dupefilter/filtered': 1,
'elapsed_time_seconds': 13.690273,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 2, 21, 10, 42, 4, 145686),
'log_count/DEBUG': 32,
'log_count/INFO': 10,
'memusage/max': 52195328,
'memusage/startup': 52195328,
'request_depth_max': 31,
'response_received_count': 31,
'scheduler/dequeued': 31,
'scheduler/dequeued/memory': 31,
'scheduler/enqueued': 31,
'scheduler/enqueued/memory': 31,
'start_time': datetime.datetime(2020, 2, 21, 10, 41, 50, 455413)}
2020-02-21 10:42:04 [scrapy.core.engine] INFO: Spider closed (finished)
I hope this helps!!
Try this approach:
Get current page number and using it as reference, get next page's number and then use it in url after multiplying with the counter
try:
nextpage_number = response.xpath("//ul[contains(@class,'hd-pagination')]/li/a[contains(@class,'active ')]/ancestor::li[1]/following-sibling::li[1]/a/@title")[0].extract()
current_url_stip = re.search( r"(.+Nao=)\d+", response.url)
new_url = "%s=%s" % ( current_url_stip.group(1), nextpage_number * 24 )
yield scrapy.Request(new_url, meta=response.meta)
except:
pass
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With