Scrapy is filtering unique urls as duplicate urls

Question

The urls:

http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=1
http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=2 are unique but scrapy is filtering these urls as duplicates and not scraping them.

I am using CrawlSpider with these rules:

rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',), ), callback='parse_product'),
)`

I do not understand this behavior, can somebody explain please? The same code was working last week. Using Scrapy version 1.3.0

javed · Accepted Answer

Following the suggestion of @paul trmbrth I rechecked the code and website that is getting scraped. Scrapy is downloading the links and filtering the links because they were downloaded before. The issue was the link attribute in 'a' tag of html was changed from a static link to some javascript function:

<a href='javascript:gtm.traceProductClick("/en-sa/mobiles/smartphones/samsung-galaxy-s7-32gb-dual-sim-lte-gold-188024">

Correspondingly I changed my spider code as:

    def _process_value(value):
    m = re.search('javascript:gtm.traceProductClick\("(.*?)"', value)
    if m:
        return m.group(1)


rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(
        allow=('/mobiles/smartphones/[a-zA-Z0-9_.-]*',),
        process_value=_process_value
    ), callback='parse_product'),
)

This was not he issue of scrapy filtering non-unique urls but it was about the extracting the link from 'href' attribute from 'a' tag because that link was changed recently and my code was broken. Thanks again @paul trmbrth

Scrapy is filtering unique urls as duplicate urls

Tags:

python

scrapy

scrapy-spider

javed

1 Answers

javed

Recent Activity

Donate For Us

Scrapy is filtering unique urls as duplicate urls

Tags:

python

scrapy

scrapy-spider

javed

1 Answers

javed

Related questions

Recent Activity

Donate For Us