Im working on crawling on a site and Im using LinkExtractor from scrapy to crawl links and determine their response status.
Moreover, I also want to use the Link Extractor to get image src's from the site. I have a code and it works well with the sites urls but i cant seem to get the images. since it wont log on the console.
handle_httpstatus_list = [404,502]
# allowed_domains = [''mydomain']
start_urls = ['somedomain.com/']
http_user = '###'
http_pass = '#####'
rules = (
Rule(LinkExtractor(allow=('domain.com',),canonicalize = True, unique = True), process_links='filter_links', follow = False, callback='parse_local_link'),
Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),
)
def filter_links(self,links):
for link in
def parse_local_link(self, response):
if response.status != 200:
item = LinkcheckerItem()
item['url'] = response.url
item['status'] = response.status
item['link_type'] = 'local'
item['referer'] = response.request.headers.get('Referer',None)
yield item
def parse_image_link(self, response):
print "Got image link"
if response.status != 200:
item = LinkcheckerItem()
item['url'] = response.url
item['status'] = response.status
item['link_type'] = 'img'
item['referer'] = response.request.headers.get('Referer',None)
yield item
In case anyone is interested in keep using the CrawlSpider with LinkExtractors, just add the kwarg deny_extensions, i.e., replace:
Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),
with
Rule(LinkExtractor(allow=('cdn.domain.com'),deny_extensions=set(), tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link')
When this parameter is not set, it defaults to scrapy.linkextractors.IGNORED_EXTENSIONS, which contains jpeg, png, and other extensions. This means the link extractor avoid links found containing said extensions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With