Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to download a full website?

After fixing the code of a website to use a CDN (rewriting all the urls to images, js & css), I need to test all the pages on the domain to make sure all the resources are fetched from the CDN.

All the sites pages are accessible through links, no isolated pages.

Currently I'm using FireBug and checking the "Net" view...

Is there some automated way to give a domain name and request all pages + resources of the domain?

Update:

OK, I found I can use wget as so:

wget -p --no-cache -e robots=off -m -H -D cdn.domain.com,www.domain.com -o site1.log www.domain.com

options explained:

  • -p - download resources too (images, css, js, etc.)
  • --no-cache - get the real object, do not return server cached object
  • -e robots=off - disregard robots and no-follow directions
  • -m - mirror site (follow links)
  • -H - span hosts (follow other domains too)
  • -D cdn.domain.com,www.domain.com - specify witch domains to follow, otherwise will follow every link from the page
  • -o site1.log - log to file site1.log
  • -U "Mozilla/5.0" - optional: fake the user agent - useful if server returns different data for different browser
  • www.domain.com - the site to download

Enjoy!

like image 778
SimonW Avatar asked Oct 23 '12 13:10

SimonW


People also ask

How can I download a whole website for free?

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.


1 Answers

The wget documentation has this bit in it:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:

      wget -E -H -k -K -p http://site/document

The key is the -H option, which means --span-hosts -> go to foreign hosts when recursive. I don't know if this also stands for normal hyperlinks or only for resources, but you should try it out.

You can consider an alternate strategy. You don't need to download the resources to test that they are referenced from the CDN. You can just get the source code for the pages you're interested in (you can use wget, as you did, or curl, or something else) and either:

  • parse it using a library - which one depends on the language you're using for scripting. Check each <img />, <link /> and <script /> for CDN links.
  • use regexes to check that the resource urls contain the CDN domain. See this :), although in this limited case it might not be overly complicated.

You should also check all CSS files for url() links - they should also point to CDN images. Depending on the logic of your apllication, you may need to check that the JavaScript code does not create any images that do not come from the CDN.

like image 200
Alex Ciminian Avatar answered Sep 20 '22 11:09

Alex Ciminian