What's the fastest and easiest way to download all the images from a website? More specifically, http://www.cycustom.com/large/.
I'm thinking something along the lines of wget or curl.
To clarify, first (and foremost) I currently do not know how to accomplish this task. Secondly, I'm interested in seeing whether wget or curl has an easier-to-understand solution. Thanks.
--- UPDATE @sarnold---
Thank you for responding. I thought that would do the trick too. However, it does not. Here's the command's output:
wget --mirror --no-parent http://www.cycustom.com/large/
--2012-01-10 18:19:36-- http://www.cycustom.com/large/
Resolving www.cycustom.com... 64.244.61.237
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.cycustom.com/large/index.html'
[ <=> ] 188,795 504K/s in 0.4s
Last-modified header missing -- time-stamps turned off.
2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795]
Loading robots.txt; please ignore errors.
--2012-01-10 18:19:37-- http://www.cycustom.com/robots.txt
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174 [text/plain]
Saving to: `www.cycustom.com/robots.txt'
100%[======================================================================================================================================================================================================================================>] 174 --.-K/s in 0s
2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174]
FINISHED --2012-01-10 18:19:37--
Downloaded: 2 files, 185K in 0.4s (505 KB/s)
Here's a picture of the files created https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg
My objective is to have a folder full of images files. The following command did not achieve this objective.
wget --mirror --no-parent http://www.cycustom.com/large/
Website Wireframing with HTML5 & CSS3 To download all images on a web page at once, use a Chrome Extension. An icon will be visible in the Chrome browser, which is for Image Downloading. Go to the webpage from where you need the images. Click on the images you want to download or select all from the checkbox.
wget --mirror --no-parent http://www.example.com/large/
The --no-parent
prevents it from slurping the entire website.
Ahh, I see they have placed a robots.txt
asking robots to not download photos from that directory:
$ curl http://www.cycustom.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /css/
Disallow: /flash/
Disallow: /large/
Disallow: /pdfs/
Disallow: /scripts/
Disallow: /small/
Disallow: /stats/
Disallow: /temp/
$
wget(1)
does not document any method to ignore robots.txt
and I've never found an easy way to perform the equivalent of --mirror
in curl(1)
. If you wanted to continue using wget(1)
, then you would need to insert an HTTP proxy in the middle that returns 404
for GET /robots.txt
requests.
I think it is easier to change approach. Since I wanted more experience using Nokogiri, here's what I came up with:
#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.cycustom.com/large/"))
doc.css('tr > td > a').each do |link|
name = link['href']
next unless name.match(/jpg/)
File.open(name, "wb") do |out|
out.write(open("http://www.cycustom.com/large/" + name))
end
end
This is just a quick and dirty script -- embedding the URL twice is a bit ugly. So if this is intended for long-term production use, clean it up first -- or figure out how to use rsync(1)
instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With