Downloading pictures with scrapy

Tags:

python

scrapy

I'm starting with scrapy, and I have first real problem. It's downloading pictures. So this is my spider.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from example.items import ProductItem
from scrapy.utils.response import get_base_url

import re

class ProductSpider(CrawlSpider):
    name = "product"
    allowed_domains = ["domain.com"]
    start_urls = [
            "http://www.domain.com/category/supplies/accessories.do"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        sites = hxs.select('//td[@class="thumbtext"]')
        number = 0
        for site in sites:
            item = ProductItem()
            xpath = '//div[@class="thumb"]/img/@src'
            item['image_urls'] = site.select(xpath).extract()[number]
            item['image_urls'] = 'http://www.domain.com' + item['image_urls']
            items.append(item)
            number = number + 1
        return items

When I quote ITEM_PIPELINES and IMAGES_STORE in settings.py this way I get the proper URL for picture I want to download (copy pasted it into browser for check).

But when I unquote those i get following error:

raise ValueError('Missing scheme in request url: %s' % self._url')
exceptions.ValueError: Missing scheme in request url:h

and I can't download my pictures.

I've searched for the whole day and didn't find anything helpful.

396

asked Jan 07 '12 22:01

iblazevic

2 Answers

I think the image URL you scraped is relative. To construct the absolute URL use urlparse.urljoin:

def parse(self, response):
    ...
    image_relative_url = hxs.select("...").extract()[0]
    import urlparse
    image_absolute_url = urlparse.urljoin(response.url, image_relative_url.strip())
    item['image_urls'] = [image_absolute_url]
    ...

Haven't used ITEM_PIPELINES, but the docs say:

In a Spider, you scrape an item and put the URLs of its images into a image_urls field.

So, item['image_urls'] should be a list of image URLs. But your code has:

item['image_urls'] = 'http://www.domain.com' + item['image_urls']

So, i guess it iterates your single URL char by char - using each as URL.

answered Oct 29 '22 01:10

warvariuc

I think that you may need to provide your image url in a list to the Item:

item['image_urls'] = [ 'http://www.domain.com' + item['image_urls'] ]

answered Oct 29 '22 02:10

ddn

Related questions
                            
                                Full text search engine for Python
                            
                                Google search with Python [duplicate]
                            
                                keyboard short cut for accessing previous statements in python IDLE using a Mac
                            
                                Draw images with canvas and use SimpleDocTemplate
                            
                                Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)
                            
                                plotting a 2D matrix in python, code and most useful visualization
                            
                                Python + Arduino with Mac OS X
                            
                                Django create template filter for nice time
                            
                                Is it possible to do bitwise operations on a string in Python?
                            
                                Display a list of user defined functions in the Python IDLE session
                            
                                Python - decimal places (putting floats into a string)
                            
                                Running South migrations for all apps
                            
                                Determine whether any files have been added, removed, or modified in a directory
                            
                                Replace property for perfomance gain
                            
                                Finding index of the same elements in a list
                            
                                Is this a "pythonic" method of executing functions as a python switch statement for tuple values?
                            
                                generic function in python - calling a method with unknown number of arguments
                            
                                Python dictionary with variables as keys
                            
                                Elegant way to transform a list of dict into a dict of dicts
                            
                                Building OpenCV 2.3.1 with Python 2.7 support in Ubuntu 11.10 64bit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With