Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to save a complete webpage on a linux server?

I need to archive complete pages including any linked images etc. on my linux server. Looking for the best solution. Is there a way to save all assets and then relink them all to work in the same directory?

I've thought about using curl, but I'm unsure of how to do all of this. Also, will I maybe need PHP-DOM?

Is there a way to use firefox on the server and copy the temp files after the address has been loaded or similar?

Any and all input welcome.

Edit:

It seems as though wget is 'not' going to work as the files need to be rendered. I have firefox installed on the server, is there a way to load the url in firefox and then grab the temp files and clear the temp files after?

like image 390
Tomas Avatar asked Jan 22 '11 17:01

Tomas


People also ask

How can I save an entire Web page?

Save a Web Page in Chrome You can also right-click anywhere on the page and select Save as or use the keyboard shortcut Ctrl + S in Windows or Command + S in macOS. Chrome can save the complete web page, including text and media assets, or just the HTML text.

How do I download an entire website in Ubuntu?

httrack is the tool you are looking for. HTTrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure.

How do I download HTML in Linux?

Wget. Wget is probably the most famous one among all the downloading options. It allows downloading from http, https, as well as FTP servers. It can download the entire website and also allows proxy browsing.


2 Answers

wget can do that, for example:

wget -r http://example.com/

This will mirror the whole example.com site.

Some interesting options are:

-Dexample.com: do not follow links of other domains
--html-extension: renames pages with text/html content-type to .html

Manual: http://www.gnu.org/software/wget/manual/

like image 123
Arnaud Le Blanc Avatar answered Nov 15 '22 21:11

Arnaud Le Blanc


Use following command:

wget -E  -k -p http://yoursite.com

Use -E to adjust extensions. Use -k to convert links to load the page from your storage. Use -p to download all objects inside the page.

Please note that this command does not download other pages hyperlinked in the specified page. It means that this command only download objects required to load the specified page properly.

like image 34
SuB Avatar answered Nov 15 '22 21:11

SuB