Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy is filtering unique urls as duplicate urls

The urls:

  1. http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=1
  2. http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=2 are unique but scrapy is filtering these urls as duplicates and not scraping them.enter image description here

I am using CrawlSpider with these rules:

rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',), ), callback='parse_product'),
)`

I do not understand this behavior, can somebody explain please? The same code was working last week. Using Scrapy version 1.3.0

like image 643
javed Avatar asked Oct 18 '22 10:10

javed


1 Answers

Following the suggestion of @paul trmbrth I rechecked the code and website that is getting scraped. Scrapy is downloading the links and filtering the links because they were downloaded before. The issue was the link attribute in 'a' tag of html was changed from a static link to some javascript function:

<a href='javascript:gtm.traceProductClick("/en-sa/mobiles/smartphones/samsung-galaxy-s7-32gb-dual-sim-lte-gold-188024">

Correspondingly I changed my spider code as:

    def _process_value(value):
    m = re.search('javascript:gtm.traceProductClick\("(.*?)"', value)
    if m:
        return m.group(1)


rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(
        allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',),
        process_value=_process_value
    ), callback='parse_product'),
)

This was not he issue of scrapy filtering non-unique urls but it was about the extracting the link from 'href' attribute from 'a' tag because that link was changed recently and my code was broken. Thanks again @paul trmbrth

like image 171
javed Avatar answered Oct 20 '22 21:10

javed