I need to find a way to crawl one of our company's web applications and create a static site from it that can be burned to a cd and used by traveling sales people to demo the web site. The back end data store is spread across many, many systems so simply running the site on a VM on the sale person's laptop won't work. And they won't have access to the internet while at some clients (no internet, cell phone....primitive, I know).
Does anyone have any good recommendations for crawlers that can handle things like link cleanup, flash, a little ajax, css, etc? I know odds are slim, but I figured I'd throw the question out here before I jump into writing my own tool.
By using a WebCrawler, e.g. one of these:
- DataparkSearch is a crawler and search engine released under the GNU General Public License.
- GNU Wget is a command-line operated crawler written in C and released under the GPL. It is typically used to mirror web and FTP sites.
- HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
- ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl websites based on Website Parse Templates using computer's free CPU resources only.
- JSpider is a highly configurable and customizable web spider engine released under the GPL.
- Larbin by Sebastien Ailleret
- Webtools4larbin by Andreas Beder
- Methabot is a speed-optimized web crawler and command line utility written in C and released under a 2-clause BSD License. It features a wide configuration system, a module system and has support for targeted crawling through local filesystem, HTTP or FTP.
- Jaeksoft WebSearch is a web crawler and indexer build over Apache Lucene. It is released under the GPL v3 license.
- Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package.
- Pavuk is a command line web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, eg. regular expression based filtering and file creation rules.
- WebVac is a crawler used by the Stanford WebBase Project.
- WebSPHINX (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine.
- WIRE - Web Information Retrieval Environment [15] is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for web characterization.
- LWP::RobotUA (Langheinrich , 2004) is a Perl class for implementing well-behaved parallel web robots distributed under Perl 5's license.
- Web Crawler Open source web crawler class for .NET (written in C#).
- Sherlock Holmes Sherlock Holmes gathers and indexes textual data (text files, web pages, ...), both locally and over the network. Holmes is sponsored and commercially used by the Czech web portal Centrum. It is also used by Onet.pl.
- YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
- Ruya Ruya is an Open Source, high performance breadth-first, level-based web crawler. It is used to crawl English and Japanese websites in a well-behaved manner. It is released under the GPL and is written entirely in the Python language. A SingleDomainDelayCrawler implementation obeys robots.txt with a crawl delay.
- Universal Information Crawler Fast developing web crawler. Crawls Saves and analyzes the data.
- Agent Kernel A Java framework for schedule, thread, and storage management when crawling.
- Spider News, Information regarding building a spider in perl.
- Arachnode.NET, is an open source promiscuous Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2005 and is released under the GPL.
- dine is a multithreaded Java HTTP client/crawler that can be programmed in JavaScript released under the LGPL.
- Crawljax is an Ajax crawler based on a method which dynamically builds a `state-flow graph' modeling the various navigation paths and states within an Ajax application. Crawljax is written in Java and released under the BSD License.
Just because nobody copy pasted a working command ... I am trying ... ten years later. :D
wget --mirror --convert-links --adjust-extension --page-requisites \
--no-parent http://example.org
It worked like a charm for me.
wget or curl can both recursively follow links and mirror an entire site, so that might be a good bet. You won't be able to use truly interactive parts of the site, like search engines, or anything that modifies the data, thoguh.
Is it possible at all to create dummy backend services that can run from the sales folks' laptops, that the app can interface with?
You're not going to be able to handle things like AJAX requests without burning a webserver to the CD, which I understand you have already said is impossible.
wget will download the site for you (use the -r parameter for "recursive"), but any dynamic content like reports and so on of course will not work properly, you'll just get a single snapshot.