Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Include Image src to LinkExtractor Scrapy CrawlSpider

Tags:

python

scrapy

Im working on crawling on a site and Im using LinkExtractor from scrapy to crawl links and determine their response status.

Moreover, I also want to use the Link Extractor to get image src's from the site. I have a code and it works well with the sites urls but i cant seem to get the images. since it wont log on the console.

handle_httpstatus_list = [404,502]
# allowed_domains = [''mydomain']

start_urls = ['somedomain.com/']

http_user = '###'
http_pass = '#####'

rules = (
    Rule(LinkExtractor(allow=('domain.com',),canonicalize = True, unique = True), process_links='filter_links', follow = False, callback='parse_local_link'),
    Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),
)

def filter_links(self,links):
    for link in

def parse_local_link(self, response):
    if response.status != 200:
        item = LinkcheckerItem()
        item['url'] = response.url
        item['status'] = response.status
        item['link_type'] = 'local'
        item['referer'] = response.request.headers.get('Referer',None)
        yield item

def parse_image_link(self, response):
    print "Got image link"
    if response.status != 200:
        item = LinkcheckerItem()
        item['url'] = response.url
        item['status'] = response.status
        item['link_type'] = 'img'
        item['referer'] = response.request.headers.get('Referer',None)
        yield item
like image 220
Vincent Pakson Avatar asked Dec 12 '25 06:12

Vincent Pakson


1 Answers

In case anyone is interested in keep using the CrawlSpider with LinkExtractors, just add the kwarg deny_extensions, i.e., replace:

    Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),

with

    Rule(LinkExtractor(allow=('cdn.domain.com'),deny_extensions=set(), tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link')

When this parameter is not set, it defaults to scrapy.linkextractors.IGNORED_EXTENSIONS, which contains jpeg, png, and other extensions. This means the link extractor avoid links found containing said extensions.

like image 181
pwoolvett Avatar answered Dec 14 '25 18:12

pwoolvett



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!