The urls:
I am using CrawlSpider with these rules:
rules = (
Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
Rule(LinkExtractor(allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',), ), callback='parse_product'),
)`
I do not understand this behavior, can somebody explain please? The same code was working last week. Using Scrapy version 1.3.0
Following the suggestion of @paul trmbrth I rechecked the code and website that is getting scraped. Scrapy is downloading the links and filtering the links because they were downloaded before. The issue was the link attribute in 'a' tag of html was changed from a static link to some javascript function:
<a href='javascript:gtm.traceProductClick("/en-sa/mobiles/smartphones/samsung-galaxy-s7-32gb-dual-sim-lte-gold-188024">
Correspondingly I changed my spider code as:
def _process_value(value):
m = re.search('javascript:gtm.traceProductClick\("(.*?)"', value)
if m:
return m.group(1)
rules = (
Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
Rule(LinkExtractor(
allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',),
process_value=_process_value
), callback='parse_product'),
)
This was not he issue of scrapy filtering non-unique urls but it was about the extracting the link from 'href' attribute from 'a' tag because that link was changed recently and my code was broken. Thanks again @paul trmbrth
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With