I'm trying to follow the pages on this website where the next page number generation is pretty strange. Instead of normal indexation, next pages look like this:
new/v2.php?cat=69&pnum=2&pnum=3
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5
and as a result my scraper gets into loop and never stops, scraping items from this kind of pages:
DEBUG: Scraped from <200 http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=1&pnum=1&pnum=2&pnum=3>`
and so on. While the scraped items are correct and match the target(s), crawler never stops, going for pages all over again.
my crawler looks like this:
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from mymobile.items import MymobileItem
class MmobySpider(CrawlSpider):
name = "mmoby2"
allowed_domains = ["mymobile.ge"]
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69&pnum=1"
]
rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php\?cat=69&pnum=\d*", ))
, callback="parse_items", follow=True),)
def parse_items(self, response):
sel = Selector(response)
titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
items = []
for t in titles:
url = t.xpath('tr//a/@href').extract()
item = MymobileItem()
item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
item["url"] = urljoin("http://mymobile.ge/new/", url[0])
items.append(item)
return(items)
any suggestion how can I tame it?
As I understand it. All page numbers appear in your start url, http://mymobile.ge/new/v2.php?cat=69&pnum=1
, so you could use follow=False
and the rule only will be executed once but it will extract all the links in that first pass.
I tried with:
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class MmobySpider(CrawlSpider):
name = "mmoby2"
allowed_domains = ["mymobile.ge"]
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69&pnum=1"
]
rules = (
Rule(SgmlLinkExtractor(
allow=("new/v2\.php\?cat=69&pnum=\d*",),
)
, callback="parse_items", follow=False),)
def parse_items(self, response):
sel = Selector(response)
print response.url
Ran it like:
scrapy crawl mmoby2
And the number of request count was six, with following output:
...
2014-05-18 12:20:35+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: None)
2014-05-18 12:20:36+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200 [mmoby2] INFO: Closing spider (finished)
2014-05-18 12:20:39+0200 [mmoby2] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1962,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
...
If extracting links with Smgllinkextractor fails you can always use simple scrapy spider and extract links for next page with selectors/xpaths, then yield Request for next page with callback to parse and stop process when there is no next page link.
Something like this should work for you.
from scrapy.spider import Spider
from scrapy.http import Request
class MmobySpider(Spider):
name = "mmoby2"
allowed_domains = ["mymobile.ge"]
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69&pnum=1"
]
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
items = []
for t in titles:
url = t.xpath('tr//a/@href').extract()
item = MymobileItem()
item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
item["url"] = urljoin("http://mymobile.ge/new/", url[0])
yield item
# extract next page link
next_page_xpath = "//td[span]/following-sibling::td[1]/a[contains(@href, 'num')]/@href"
next_page = sel.xpath(next_page_xpath).extract()
# if there is next page yield Request for it
if next_page:
next_page = urljoin(response.url, next_page[0])
yield Request(next_page, callback=self.parse)
Xpath for next page is not an easy one due to completely unsemantic markup of your page, but it should work ok.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With