Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble downloading images using scrapy

I've written a script in python scrapy to download some images from a website. When i run my script, I can see the link of images (all of them are in .jpg format) in the console. However, when I open the folder in which the images are supposed to be saved when the downloading is done, I get nothing in there. Where I'm making mistakes?

This is my spider (I'm running from sublime text editor):

import scrapy
from scrapy.crawler import CrawlerProcess

class YifyTorrentSpider(scrapy.Spider):
    name = "yifytorrent"

    start_urls= ['https://www.yify-torrent.org/search/1080p/']

    def parse(self, response):
        for q in response.css("article.img-item .poster-thumb"):
            image = response.urljoin(q.css("::attr(src)").extract_first())
            yield {'':image}

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',   
})
c.crawl(YifyTorrentSpider)
c.start()

This is what I've defined in settings.py for the images to be saved:

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "/Desktop/torrentspider/torrentspider/spiders/Images"

To make things clearer:

  1. The folder in which I'm expecting the images to be saved named as Images which I've placed in the spider folder under the project torrentspider.
  2. Actual address to the Images folder is C:\Users\WCS\Desktop\torrentspider\torrentspider\spiders.

It's not about running the script successfully with the help of items.py file. So, any solution to make the download happen with the use of items.py file is not what I'm looking for.

like image 970
SIM Avatar asked Jul 02 '18 15:07

SIM


1 Answers

The item you are yielding does not follow the documentation of Scrapy. As detailed in their media pipeline documentation the item should have a field called image_urls. You should change your parse method to something similar to this.

def parse(self, response):
    images = []
    for q in response.css("article.img-item .poster-thumb"):
        image = response.urljoin(q.css("::attr(src)").extract_first())
        images.append(image)
    yield {'image_urls': images} 

I just tested this and it works. Additionally, as commented by Pruthvi Kumar, the IMAGES_STORE should just be like

IMAGES_STORE = 'Images'
like image 129
gusridd Avatar answered Sep 29 '22 23:09

gusridd