Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Next pages and scrapy crawler doesn't work

Tags:

python

scrapy

I'm trying to follow the pages on this website where the next page number generation is pretty strange. Instead of normal indexation, next pages look like this:

new/v2.php?cat=69&pnum=2&pnum=3
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5

and as a result my scraper gets into loop and never stops, scraping items from this kind of pages:

DEBUG: Scraped from <200 http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=1&pnum=1&pnum=2&pnum=3>`

and so on. While the scraped items are correct and match the target(s), crawler never stops, going for pages all over again.

my crawler looks like this:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


from mymobile.items import MymobileItem


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php\?cat=69&pnum=\d*", ))
            , callback="parse_items", follow=True),)

    def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            items.append(item)

        return(items)   

any suggestion how can I tame it?

like image 937
ikechi Avatar asked Feb 13 '23 23:02

ikechi


2 Answers

As I understand it. All page numbers appear in your start url, http://mymobile.ge/new/v2.php?cat=69&pnum=1, so you could use follow=False and the rule only will be executed once but it will extract all the links in that first pass.

I tried with:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [ 
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]   

    rules = ( 
        Rule(SgmlLinkExtractor(
                allow=("new/v2\.php\?cat=69&pnum=\d*",),
            )   
            , callback="parse_items", follow=False),)

    def parse_items(self, response):
        sel = Selector(response)
        print response.url

Ran it like:

scrapy crawl mmoby2

And the number of request count was six, with following output:

...
2014-05-18 12:20:35+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: None)
2014-05-18 12:20:36+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200 [mmoby2] INFO: Closing spider (finished)
2014-05-18 12:20:39+0200 [mmoby2] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 1962,
         'downloader/request_count': 6,
         'downloader/request_method_count/GET': 6,
         ...
like image 55
Birei Avatar answered Feb 18 '23 14:02

Birei


If extracting links with Smgllinkextractor fails you can always use simple scrapy spider and extract links for next page with selectors/xpaths, then yield Request for next page with callback to parse and stop process when there is no next page link.

Something like this should work for you.

from scrapy.spider import Spider
from scrapy.http import Request

class MmobySpider(Spider):
    name = "mmoby2"
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    def parse(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            yield item

        # extract next page link
        next_page_xpath = "//td[span]/following-sibling::td[1]/a[contains(@href, 'num')]/@href"
        next_page = sel.xpath(next_page_xpath).extract()

        # if there is next page yield Request for it
        if next_page:
            next_page = urljoin(response.url, next_page[0])
            yield Request(next_page, callback=self.parse)

Xpath for next page is not an easy one due to completely unsemantic markup of your page, but it should work ok.

like image 25
Pawel Miech Avatar answered Feb 18 '23 15:02

Pawel Miech