Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Images Pipeline, download images

Following: scrapy's tutorial i made a simple image crawler (scrapes images of Bugattis). Which is illustrated below in EXAMPLE.

However, following the guide has left me with a non functioning crawler! It finds all of the urls but it does not download the images.

I found a duck tape solution: replace ITEM_PIPELINES and IMAGES_STORE such that;

ITEM_PIPELINES['scrapy.pipeline.images.FilesPipeline'] = 1 and

IMAGES_STORE -> FILES_STORE

But I do not know why this works? I would like to use the ImagePipeline as documented by scrapy.

EXAMPLE

settings.py

BOT_NAME = 'imagespider'
SPIDER_MODULES = ['imagespider.spiders']
NEWSPIDER_MODULE = 'imagespider.spiders'
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "/home/user/Desktop/imagespider/output"

items.py

import scrapy

class ImageItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

imagespider.py

from imagespider.items import ImageItem
import scrapy


class ImageSpider(scrapy.Spider):
    name = "imagespider"

    start_urls = (
        "https://www.find.com/search=bugatti+veyron",
    )

    def parse(self, response):
        for elem in response.xpath("//img"):
            img_url = elem.xpath("@src").extract_first()
            yield ImageItem(file_urls=[img_url])
like image 806
Alexander R Johansen Avatar asked Jul 26 '16 11:07

Alexander R Johansen


People also ask

How do you download pictures from Scrapy?

Enabling your Media PipelineTo enable your media pipeline you must first add it to your project ITEM_PIPELINES setting. You can also use both the Files and Images Pipeline at the same time. Then, configure the target storage setting to a valid value that will be used for storing the downloaded images.

Does Scrapy use lxml?

It uses lxml library under the hood, and implements an easy API on top of lxml API. It means Scrapy selectors are very similar in speed and parsing accuracy to lxml.

How does a Scrapy pipeline work?

Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.

Can you use BeautifulSoup with Scrapy?

Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.


1 Answers

The item your spider returns must contains fields "file_urls" for files and/or "image_urls" for images. In your code you specify settings for Image pipeline but your return urls in "file_urls".

Simply change this line:

yield ImageItem(file_urls=[img_url])
# to
yield {'image_urls': [img_url]}

* scrapy can return dictionary objects instead of items, which saves time when you only have one or two fields.

like image 122
Granitosaurus Avatar answered Sep 28 '22 20:09

Granitosaurus