Im writing a Scrapy CrawlSpider that reads a list of ADs on first page, takes some info like thumbs of the listings and AD urls, then yields a request to each of this AD urls to take their details.
It was working and paginating apparently well on test enviroment, but today trying to make a complete run I realized that in log:
Crawled 3852 pages (at 228 pages/min), scraped 256 items (at 15 items/min)
I'm not understanding the reason of this big difference between Crawled pages and Scraped items. Anybody can help me to realize where that items are getting lost?
My spider code:
class MySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["myspider.com", "myspider.co"]
start_urls = [
"http://www.myspider.com/offers/myCity/typeOfAd/?search=fast",
]
#Pagination
rules = (
Rule (
SgmlLinkExtractor()
, callback='parse_start_url', follow= True),
)
#1st page
def parse_start_url(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select("//a[@class='pagNext']/@href").extract()
offers = hxs.select("//div[@class='hlist']")
for offer in offers:
myItem = myItem()
myItem['url'] = offer.select('.//span[@class="location"]/a/@href').extract()[0]
myItem['thumb'] = oferta.select('.//div[@class="itemFoto"]/div/a/img/@src').extract()[0]
request = Request(myItem['url'], callback = self.second_page)
request.meta['myItem'] = myItem
yield request
if next_page:
yield Request(next_page[0], callback=self.parse_start_url)
def second_page(self,response):
myItem = response.meta['myItem']
loader = myItemLoader(item=myItem, response=response)
loader.add_xpath('address', '//span[@itemprop="streetAddress"]/text()')
return loader.load_item()
Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking. 2.0.
BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
Let's say you go to your first start_urls
(actually you only have one) and on this page there is only one anchor link (<a>
). So your spider crawls the href
url in this link and you get control in your callback, parse_start_url
. And inside of this page you have 5000 div's with an hlist
class. And let's suppose all 5000 of these subsequent URLs were returned 404, not found.
In this case you would have:
Let's take another example: on your start url page you have 5000 anchors, but none (as in zero) of those pages have any divs with a class parameter of hlist
.
In this case you would have:
Your answer lies in the DEBUG log output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With