There are a number of tools on the internet for downloading a static copy of a website, such as HTTrack. There are also many tools, some commercial, for “scraping” content from a website, such as Mozenda. Then there are tools which are apparently built in to programs like PHP and *nix where you can “file_get_contents” or “wget” or “cURL” or just “file()”.
I am thoroughly confused by all of this, and I think the main reason is that none of the descriptions I have come across use the same vocabulary. On the surface, at least, it seems like they are all doing the same thing, but maybe not.
That is my question. What are these tools doing, exactly? Are they doing the same thing? Are they doing the same thing via different technology? If they aren’t doing the same thing, how are they different?
First, let me clarify the difference between "mirroring" and "scraping".
Mirroring refers to downloading the entire contents of a website, or some prominent section(s) of it (including HTML, images, scripts, CSS stylesheets, etc). This is often done to preserve and expand access to a valuable (and often limited) internet resource, or to add additional fail-over redundancy. For example, many universities and IT companies mirror various Linux vendors' release archives. Mirroring may imply that you plan on hosting a copy of the website on your own server (with the original content owner's permission).
Scraping refers to copying and extracting some interesting data from a website. Unlike mirroring, scraping targets a particular dataset (names, phone numbers, stock quotes, etc) rather than the entire contents of the site. For example, you could "scrape" average income data from the US Census Bureau or stock quotes from Google Finance. This is sometimes done against the terms and conditions of the host, making it illegal.
The two can be combined in order to separate data copying (mirroring) from information extraction (scraping) concerns. For example, you may find that its quicker to mirror a site, and then scrape your local copy if the extraction and analysis of the data is slow or process-intensive.
To answer the rest of your question...
file_get_contents
and file
PHP functions are for reading a file from a local or remote machine. The file may be an HTML file, or it could be something else, like a text file or a spreadsheet. This is not what either "mirroring" or "scraping" usually refers to, although you could write your own PHP-based mirror/scraper using these.
wget
and curl
are command-line stand-alone programs for downloading one or more files from remote servers, using a variety of options, conditions and protocols. Both are incredibly powerful and popular tools, the main difference being that wget
has rich built-in features for mirroring entire websites.
HTTrack
is similar to wget
in its intent, but uses a GUI instead of a command-line. This makes it easier to use for those not comfortable running commands from a terminal, at the cost of losing the power and flexibility provided by wget
.
You can use HTTrack
and wget
for mirroring, but you will have to run your own programs on the resulting downloaded data to extract (scrape) information, if that's your ultimate goal.
Mozenda
is a scraper, which, unlike HTTrack
, wget
or curl
allows you to target specific data to be extracted, rather than blindly copying all contents. I have little experience with it, however.
P.S. I usually use wget
to mirror the HTML pages I'm interested in, and then run a combination of Ruby and R scripts to extract and analyze data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With