Include Image src to LinkExtractor Scrapy CrawlSpider

Question

Im working on crawling on a site and Im using LinkExtractor from scrapy to crawl links and determine their response status.

Moreover, I also want to use the Link Extractor to get image src's from the site. I have a code and it works well with the sites urls but i cant seem to get the images. since it wont log on the console.

handle_httpstatus_list = [404,502]
# allowed_domains = [''mydomain']

start_urls = ['somedomain.com/']

http_user = '###'
http_pass = '#####'

rules = (
    Rule(LinkExtractor(allow=('domain.com',),canonicalize = True, unique = True), process_links='filter_links', follow = False, callback='parse_local_link'),
    Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),
)

def filter_links(self,links):
    for link in

def parse_local_link(self, response):
    if response.status != 200:
        item = LinkcheckerItem()
        item['url'] = response.url
        item['status'] = response.status
        item['link_type'] = 'local'
        item['referer'] = response.request.headers.get('Referer',None)
        yield item

def parse_image_link(self, response):
    print "Got image link"
    if response.status != 200:
        item = LinkcheckerItem()
        item['url'] = response.url
        item['status'] = response.status
        item['link_type'] = 'img'
        item['referer'] = response.request.headers.get('Referer',None)
        yield item

pwoolvett · Accepted Answer

In case anyone is interested in keep using the CrawlSpider with LinkExtractors, just add the kwarg deny_extensions, i.e., replace:

    Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),

with

    Rule(LinkExtractor(allow=('cdn.domain.com'),deny_extensions=set(), tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link')

When this parameter is not set, it defaults to scrapy.linkextractors.IGNORED_EXTENSIONS, which contains jpeg, png, and other extensions. This means the link extractor avoid links found containing said extensions.

Include Image src to LinkExtractor Scrapy CrawlSpider

Tags:

python

scrapy

Vincent Pakson

1 Answers

pwoolvett

Recent Activity

Donate For Us

Include Image src to LinkExtractor Scrapy CrawlSpider

Tags:

python

scrapy

Vincent Pakson

1 Answers

pwoolvett

Related questions

Recent Activity

Donate For Us