Scrapy spider difference between Crawled pages and Scraped items

Tags:

Im writing a Scrapy CrawlSpider that reads a list of ADs on first page, takes some info like thumbs of the listings and AD urls, then yields a request to each of this AD urls to take their details.

It was working and paginating apparently well on test enviroment, but today trying to make a complete run I realized that in log:

Crawled 3852 pages (at 228 pages/min), scraped 256 items (at 15 items/min)

I'm not understanding the reason of this big difference between Crawled pages and Scraped items. Anybody can help me to realize where that items are getting lost?

My spider code:

class MySpider(CrawlSpider):
    name = "myspider"
    allowed_domains = ["myspider.com", "myspider.co"]
    start_urls = [
        "http://www.myspider.com/offers/myCity/typeOfAd/?search=fast",
    ]

    #Pagination
    rules = (
        Rule (
            SgmlLinkExtractor()
           , callback='parse_start_url', follow= True),
    )

    #1st page
    def parse_start_url(self, response):

        hxs = HtmlXPathSelector(response)

        next_page = hxs.select("//a[@class='pagNext']/@href").extract()
        offers = hxs.select("//div[@class='hlist']")

        for offer in offers:
            myItem = myItem()

            myItem['url'] = offer.select('.//span[@class="location"]/a/@href').extract()[0]
            myItem['thumb'] = oferta.select('.//div[@class="itemFoto"]/div/a/img/@src').extract()[0]

            request = Request(myItem['url'], callback = self.second_page)
            request.meta['myItem'] = myItem

            yield request

        if next_page:
            yield Request(next_page[0], callback=self.parse_start_url)


    def second_page(self,response):
        myItem = response.meta['myItem']

        loader = myItemLoader(item=myItem, response=response)

        loader.add_xpath('address', '//span[@itemprop="streetAddress"]/text()') 

        return loader.load_item()

409

asked Apr 11 '13 20:04

André Teixeira

1 Answers

Let's say you go to your first start_urls (actually you only have one) and on this page there is only one anchor link (<a>). So your spider crawls the href url in this link and you get control in your callback, parse_start_url. And inside of this page you have 5000 div's with an hlist class. And let's suppose all 5000 of these subsequent URLs were returned 404, not found.

In this case you would have:

Pages crawled: 5001
Items scraped: 0

Let's take another example: on your start url page you have 5000 anchors, but none (as in zero) of those pages have any divs with a class parameter of hlist.

In this case you would have:

Pages crawled: 5001
Items scraped: 0

Your answer lies in the DEBUG log output.

answered Sep 22 '22 05:09

Steven Almeroth

Related questions
                            
                                Unable to rotate a matplotlib patch object about a specific point using rotate_around( )
                            
                                pythonanywhere 404 error
                            
                                Anaconda Acclerate / NumbaPro CUDA Linking Error OSX
                            
                                Python/Numpy - Cross Product of Matching Rows in Two Arrays
                            
                                Python Create Access database using win32com
                            
                                getPass() echoing password in Eclipse
                            
                                Python exec and __name__
                            
                                python fabric no host found must manually set 'env.host_string'
                            
                                Where are Python's stdlib tests?
                            
                                Why does using /usr/bin/env break my Python import?
                            
                                Python, get index from list of lists
                            
                                Safely remove all html code from a string in python
                            
                                Calculating items included in branch and bound knapsack
                            
                                Find mixed types in Pandas columns
                            
                                real time subprocess.Popen via stdout and PIPE
                            
                                Convert a string with date and time to a date [duplicate]
                            
                                does xlwt support xlsx Format
                            
                                Python - Display 3D Point Cloud [closed]
                            
                                Comparing two .txt files using difflib in Python
                            
                                How to make a python decorator function in Flask with arguments (for authorization)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy spider difference between Crawled pages and Scraped items

Tags:

python

scrapy

web-crawler

André Teixeira

People also ask

1 Answers

Steven Almeroth

Recent Activity

Donate For Us