Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest and easiest way to download all the images from a website

Tags:

curl

wget

What's the fastest and easiest way to download all the images from a website? More specifically, http://www.cycustom.com/large/.

I'm thinking something along the lines of wget or curl.

To clarify, first (and foremost) I currently do not know how to accomplish this task. Secondly, I'm interested in seeing whether wget or curl has an easier-to-understand solution. Thanks.

--- UPDATE @sarnold---

Thank you for responding. I thought that would do the trick too. However, it does not. Here's the command's output:

wget --mirror --no-parent http://www.cycustom.com/large/
--2012-01-10 18:19:36--  http://www.cycustom.com/large/
Resolving www.cycustom.com... 64.244.61.237
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.cycustom.com/large/index.html'

    [  <=>                                                                                                                                                                                                                                  ] 188,795      504K/s   in 0.4s    

Last-modified header missing -- time-stamps turned off.
2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795]

Loading robots.txt; please ignore errors.
--2012-01-10 18:19:37--  http://www.cycustom.com/robots.txt
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174 [text/plain]
Saving to: `www.cycustom.com/robots.txt'

100%[======================================================================================================================================================================================================================================>] 174         --.-K/s   in 0s      

2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174]

FINISHED --2012-01-10 18:19:37--
Downloaded: 2 files, 185K in 0.4s (505 KB/s)

Here's a picture of the files created https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg

My objective is to have a folder full of images files. The following command did not achieve this objective.

wget --mirror --no-parent http://www.cycustom.com/large/
like image 970
John Erck Avatar asked Jan 11 '12 00:01

John Erck


People also ask

How can I download all pictures from a website at once?

Website Wireframing with HTML5 & CSS3 To download all images on a web page at once, use a Chrome Extension. An icon will be visible in the Chrome browser, which is for Image Downloading. Go to the webpage from where you need the images. Click on the images you want to download or select all from the checkbox.


1 Answers

wget --mirror --no-parent http://www.example.com/large/

The --no-parent prevents it from slurping the entire website.


Ahh, I see they have placed a robots.txt asking robots to not download photos from that directory:

$ curl http://www.cycustom.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /css/
Disallow: /flash/
Disallow: /large/
Disallow: /pdfs/
Disallow: /scripts/
Disallow: /small/
Disallow: /stats/
Disallow: /temp/
$ 

wget(1) does not document any method to ignore robots.txt and I've never found an easy way to perform the equivalent of --mirror in curl(1). If you wanted to continue using wget(1), then you would need to insert an HTTP proxy in the middle that returns 404 for GET /robots.txt requests.

I think it is easier to change approach. Since I wanted more experience using Nokogiri, here's what I came up with:

#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.cycustom.com/large/"))

doc.css('tr > td > a').each do |link|
  name = link['href']
  next unless name.match(/jpg/)
  File.open(name, "wb") do |out|
    out.write(open("http://www.cycustom.com/large/" + name))
  end
end

This is just a quick and dirty script -- embedding the URL twice is a bit ugly. So if this is intended for long-term production use, clean it up first -- or figure out how to use rsync(1) instead.

like image 175
sarnold Avatar answered Sep 28 '22 07:09

sarnold