What's the fastest and easiest way to download all the images from a website? More specifically, http://www.cycustom.com/large/. I'm thinking something along the lines of wget or curl. To clarify, first (and foremost) I currently do not know how to accomplish this task. Secondly, I'm interested in seeing whether wget or curl has an easier-to-understand solution. Thanks. --- UPDATE @sarnold--- Thank you for responding. I thought that would do the trick too. However, it does not. Here's the command's output: <pre class="prettyprint"><code>wget --mirror --no-parent http://www.cycustom.com/large/ --2012-01-10 18:19:36-- http://www.cycustom.com/large/ Resolving www.cycustom.com... 64.244.61.237 Connecting to www.cycustom.com|64.244.61.237|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `www.cycustom.com/large/index.html' [ <=> ] 188,795 504K/s in 0.4s Last-modified header missing -- time-stamps turned off. 2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795] Loading robots.txt; please ignore errors. --2012-01-10 18:19:37-- http://www.cycustom.com/robots.txt Connecting to www.cycustom.com|64.244.61.237|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 174 [text/plain] Saving to: `www.cycustom.com/robots.txt' 100%[======================================================================================================================================================================================================================================>] 174 --.-K/s in 0s 2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174] FINISHED --2012-01-10 18:19:37-- Downloaded: 2 files, 185K in 0.4s (505 KB/s) </code></pre> Here's a picture of the files created https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg My objective is to have a folder full of images files. The following command did not achieve this objective. <pre class="prettyprint"><code>wget --mirror --no-parent http://www.cycustom.com/large/ </code></pre>

<pre class="prettyprint"><code>wget --mirror --no-parent http://www.example.com/large/ </code></pre> The <code>--no-parent</code> prevents it from slurping the entire website. <hr> Ahh, I see they have placed a <code>robots.txt</code> asking robots to not download photos from that directory: <pre class="prettyprint"><code>$ curl http://www.cycustom.com/robots.txt User-agent: * Disallow: /admin/ Disallow: /css/ Disallow: /flash/ Disallow: /large/ Disallow: /pdfs/ Disallow: /scripts/ Disallow: /small/ Disallow: /stats/ Disallow: /temp/ $ </code></pre> <code>wget(1)</code> does not document any method to ignore <code>robots.txt</code> and I've never found an easy way to perform the equivalent of <code>--mirror</code> in <code>curl(1)</code>. If you wanted to continue using <code>wget(1)</code>, then you would need to insert an HTTP proxy in the middle that returns <code>404</code> for <code>GET /robots.txt</code> requests. I think it is easier to change approach. Since I wanted more experience using Nokogiri, here's what I came up with: <pre class="prettyprint lang-ruby prettyprint-override"><code>#!/usr/bin/ruby require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open("http://www.cycustom.com/large/")) doc.css('tr > td > a').each do |link| name = link['href'] next unless name.match(/jpg/) File.open(name, "wb") do |out| out.write(open("http://www.cycustom.com/large/" + name)) end end </code></pre> This is just a quick and dirty script -- embedding the URL twice is a bit ugly. So if this is intended for long-term production use, clean it up first -- or figure out how to use <code>rsync(1)</code> instead.

What's the fastest and easiest way to download all the images from a website

Tags:

curl

wget

What's the fastest and easiest way to download all the images from a website? More specifically, http://www.cycustom.com/large/.

I'm thinking something along the lines of wget or curl.

To clarify, first (and foremost) I currently do not know how to accomplish this task. Secondly, I'm interested in seeing whether wget or curl has an easier-to-understand solution. Thanks.

--- UPDATE @sarnold---

Thank you for responding. I thought that would do the trick too. However, it does not. Here's the command's output:

wget --mirror --no-parent http://www.cycustom.com/large/
--2012-01-10 18:19:36--  http://www.cycustom.com/large/
Resolving www.cycustom.com... 64.244.61.237
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.cycustom.com/large/index.html'

    [  <=>                                                                                                                                                                                                                                  ] 188,795      504K/s   in 0.4s    

Last-modified header missing -- time-stamps turned off.
2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795]

Loading robots.txt; please ignore errors.
--2012-01-10 18:19:37--  http://www.cycustom.com/robots.txt
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174 [text/plain]
Saving to: `www.cycustom.com/robots.txt'

100%[======================================================================================================================================================================================================================================>] 174         --.-K/s   in 0s      

2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174]

FINISHED --2012-01-10 18:19:37--
Downloaded: 2 files, 185K in 0.4s (505 KB/s)

Here's a picture of the files created https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg

My objective is to have a folder full of images files. The following command did not achieve this objective.

wget --mirror --no-parent http://www.cycustom.com/large/

970

asked Jan 11 '12 00:01

John Erck

1 Answers

wget --mirror --no-parent http://www.example.com/large/

The --no-parent prevents it from slurping the entire website.

Ahh, I see they have placed a robots.txt asking robots to not download photos from that directory:

$ curl http://www.cycustom.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /css/
Disallow: /flash/
Disallow: /large/
Disallow: /pdfs/
Disallow: /scripts/
Disallow: /small/
Disallow: /stats/
Disallow: /temp/
$

wget(1) does not document any method to ignore robots.txt and I've never found an easy way to perform the equivalent of --mirror in curl(1). If you wanted to continue using wget(1), then you would need to insert an HTTP proxy in the middle that returns 404 for GET /robots.txt requests.

I think it is easier to change approach. Since I wanted more experience using Nokogiri, here's what I came up with:

#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.cycustom.com/large/"))

doc.css('tr > td > a').each do |link|
  name = link['href']
  next unless name.match(/jpg/)
  File.open(name, "wb") do |out|
    out.write(open("http://www.cycustom.com/large/" + name))
  end
end

This is just a quick and dirty script -- embedding the URL twice is a bit ugly. So if this is intended for long-term production use, clean it up first -- or figure out how to use rsync(1) instead.

175

answered Sep 28 '22 07:09

sarnold

Related questions
                            
                                How to get only a file form a Github PROTECTED repository
                            
                                curl will not return while performing a HEAD request
                            
                                Syntax errors when installing Composer?
                            
                                How do I enable curl for php
                            
                                Posting to Flask with Postman versus requests populates different request attributes
                            
                                OTRS Generic Interface (Search ticket) - Array for URL Query Param
                            
                                Wunderlist Tasks - how do I get the tasks for a list using curl in php
                            
                                size of array '__curl_rule_01__' is negative
                            
                                Php upload image to remote server with cURL
                            
                                How to convert a CURL command to Fetch?
                            
                                Add `Authorization Bearer` hash to Net::HTTP post request (Ruby)
                            
                                CURLOPT_VERBOSE not working
                            
                                Upload SSH public key to bitbucket cloud using curl/REST and token
                            
                                Can't start a "curl:localhost:3000" port, shows URI Error
                            
                                Why could Curl be slower than a web browser?
                            
                                Download htaccess protected files using PHP and CURL
                            
                                Download a image from SSL using curl?
                            
                                Post picture to Facebook using Python
                            
                                How to set downtime for any specific nagios host for certain time from commandline through curl?
                            
                                Getting a list of files on a web server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With