Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Extract Data from mutliple pages

class HomedepotcrawlSpider(scrapy.Spider):
      name = 'homeDepotCrawl'
      allowed_domains = ['homedepot.com']
      start_urls = ['https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=0']

def parse(self, response):

    #call home depot function
    for item in self.parseHomeDepot(response):
        yield item

    nextPageSelect = '.hd-pagination__link'
    next_page = response.css(nextPageSelect).getall()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

Edit-

The way I got .hd-pagination__link was using a CSS selector extension for google chrome and selected the next page icon (Screenshot attached)

Screenshot of css for next page

So I've tried a few things and this is the way that made the most sense to me and I think that I'm just grabbing the wrong object for the next page. As of right now my program only grabs the data from the first page and seems like the code block to traverse pages is being ignored.

I found a pattern with the URL where page numbers are denoted within increments of 24 (Maybe due to items numbers?). EX:

Page 1: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=0

Page 2: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=24

page 3: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=48 . . . . . . .

page n: https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=[(n*24) - 24]

ect.

When I tried testing out code related to page numbers[incrementing the number after Na0 = x], I would just loop through the first page x amount of times. (My output would be the first page (24 items) repeated x amount of times.

I've also looked into crawl spider but couldn't really understand it/implementation.

Any help with my code/clarification on other methods would be appreciated!

Also this is not my whole program, I'm keeping out my parseHomeDepot function because I don't think it is necessary, but if the code is needed, just let me know!

like image 941
chrisHG Avatar asked Feb 13 '20 22:02

chrisHG


3 Answers

Seems to me like you have a couple issues.

First of all, you may be getting the whole html element that contains the link for the next page, whereas what you're looking for is the link only. So I suggest you use the css selector like so:

nextPageSelect = '.hd-pagination__link::attr(href)'

This will get you the links instead of the whole HTML element. I suggest looking further into css selectors here.

Secondly, there seems to be an issue with your code, logically.

next_page = response.css(nextPageSelect).getall()

this piece of code gets you a list of all the 'next page' links on your current page, but you treat the whole list as one link. I suggest a for loop. Something like this:

   if next_pages:
    for page in next_pages:
        yield scrapy.Request(
            response.urljoin(page),
            callback=self.parse
        )

Now moving on, I think to better make use of Scrapy's parallel and concurrency features, you may want to return a list of 'scrapy.Requests' instead of doing a yield for every request you find. So to summarize:

nextPageSelect = '.hd-pagination__link::attr(href)'
next_pages = response.css(nextPageSelect).getall()
requests = []
if next_pages:
    for page in next_pages:
        requests.append(scrapy.Request(
            response.urljoin(page),
            callback=self.parse
        ))
return requests

Good luck!

like image 132
UzairAhmed Avatar answered Oct 08 '22 21:10

UzairAhmed


Here is some working code for what you want to do:

import scrapy
from urllib.parse import urlsplit, urljoin

class HomedepotSpider(scrapy.Spider):
    name = 'homedepot'
    start_urls = ['https://www.homedepot.com/b/N-5yc1v/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&experienceName=default&Nao=0']

    def parse(self, response):

        # Here you do something with your items

        next_page = response.css('a.hd-pagination__link[title=Next]::attr(href)').get()
        if next_page is not None:
            o = urlsplit(response.url)
            base_url = f'{o.scheme}://{o.netloc}'
            next_page_url = urljoin(base_url,next_page)
            yield response.follow(next_page_url, callback=self.parse)

The main things I would point you to in this code are:

  1. Check the selector for the next page. It checks the attribute title and only selects the element that has title "Next". That is to identify the last button in the pagination. I´m not sure if your example is identifying the right button
  2. The next page you get is a relative url. What this does is use urljoin to get the base part of the current url (response.url) and then appends this relative you are getting to that with urljoin
  3. Once you have the url of the next page you just can use response.follow to indicate the spider to add that url with your selected callback to the list of urls to crawl

Also here is the scrapy logs showing it has crawled 31 pages. This is what you should get if you execute it

2020-02-21 10:42:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 44799,
 'downloader/request_count': 31,
 'downloader/request_method_count/GET': 31,
 'downloader/response_bytes': 1875031,
 'downloader/response_count': 31,
 'downloader/response_status_count/200': 31,
 'dupefilter/filtered': 1,
 'elapsed_time_seconds': 13.690273,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 2, 21, 10, 42, 4, 145686),
 'log_count/DEBUG': 32,
 'log_count/INFO': 10,
 'memusage/max': 52195328,
 'memusage/startup': 52195328,
 'request_depth_max': 31,
 'response_received_count': 31,
 'scheduler/dequeued': 31,
 'scheduler/dequeued/memory': 31,
 'scheduler/enqueued': 31,
 'scheduler/enqueued/memory': 31,
 'start_time': datetime.datetime(2020, 2, 21, 10, 41, 50, 455413)}
2020-02-21 10:42:04 [scrapy.core.engine] INFO: Spider closed (finished)

I hope this helps!!

like image 26
Alvaro Aguilar Avatar answered Oct 08 '22 19:10

Alvaro Aguilar


Try this approach:

Get current page number and using it as reference, get next page's number and then use it in url after multiplying with the counter

try:
    nextpage_number = response.xpath("//ul[contains(@class,'hd-pagination')]/li/a[contains(@class,'active ')]/ancestor::li[1]/following-sibling::li[1]/a/@title")[0].extract()
    current_url_stip = re.search( r"(.+Nao=)\d+", response.url)
    new_url = "%s=%s" % ( current_url_stip.group(1), nextpage_number * 24 )
    yield scrapy.Request(new_url, meta=response.meta)
except:
    pass
like image 20
Janib Soomro Avatar answered Oct 08 '22 19:10

Janib Soomro