Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to store scrapy images on Amazon S3?

I've been using Scrapy for about 1 week now, and want to store the images to amazon S3, and they mentioned that they support images uploading to amazon S3 but it's not documented. So does anyone know how to use Amazon S3 with Scrapy?

Here's their Scrapy documentation for media pipeline.

like image 811
Mahmoud M. Abdel-Fattah Avatar asked May 06 '12 16:05

Mahmoud M. Abdel-Fattah


2 Answers

You need 3 settings:

AWS_ACCESS_KEY_ID = "xxxxxx"
AWS_SECRET_ACCESS_KEY = "xxxxxx"
IMAGES_STORE = "s3://bucketname/base-key-dir-if-any/"

that's all, ie. images will be stored using same directory structured described at http://readthedocs.org/docs/scrapy/en/latest/topics/images.html#file-system-storage, ie:

s3://bucketname/base-key-dir-if-any/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
like image 83
dangra Avatar answered Oct 10 '22 21:10

dangra


It has been a couple of years since the last answer, and some things have changed (2015). Nick Verwymeren wrote a blog post detailing an updated version of how to do this. His blog post is here: https://www.nickv.codes/blog/scrapy-uploading-image-files-to-amazon-s3/

in your settings.py file:

ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.images.ImagesPipeline': 1
}

# This is going to be the amazon s3 bucket. 
# You need to use the below format so Scrapy 
# can parse it. !!Important don't forget to add 
# the trailing slash.
IMAGES_STORE = 's3://my-bucket-name/'

# The amount of days until we re-download the image
IMAGES_EXPIRES = 180     

# You can add as many of these as you want
IMAGES_THUMBS = {
    'small': (50, 50), 
    'big': (300, 300)
}

AWS_ACCESS_KEY_ID = 'your-access-key'
AWS_SECRET_ACCESS_KEY= 'your-secret-access-key'

For the sake of security I suggest creating a new user in the Amazon AWS interface and give that user only read/write privileges to your bucket.

Now we need to install a few packages that didn’t come by default with Scrapy:

pip install pillow
pip intall botocore

Pillow handles the image manipulation and boto will provide the library that connects to S3.

Scrapy uses the image_urls key in your item to look for images it should download. This should be a list of image urls. Once downloaded Scrapy writes the details of the image locaiton to the images key.

Don’t forget to add these to your items.py file:

class MyItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

Now don’t forget to actually populate the image_urls key during your crawl. Once you crawl your site final output will look something like this for a given item:

'image_urls': [u'http://example.com/images/tshirt.jpg'],
'images': [{ 'checksum': '264d3bbdffd4ab3dcb8f234c51329da8',
         'path': 'full/069f409fd4cdb02248d726a625fecd8299e6055e.jpg',
         'url': 'http://example.com/images/tshirt.jpg'}],

Now head on over to you amazon S3 bucket and have a look. Your images and thumbnails are all there!

Again, a big thanks to Nick Verwymeren for the blog post that answers this question exactly!

like image 23
Sam Texas Avatar answered Oct 10 '22 21:10

Sam Texas