For a list of websites, I want to get the pages indexed by year, if they were archived at any point that year. So if I'm looking at example1.com
and example2.com
, I want to be able to get:
2010: example1.com, example2.com (the html from these archived pages)
2011: example1.com (example2.com, say, was not archived in 2011)
2012: example2.com
2013: example1.com, example2.com
and so on.
Is this possible to get using the Wayback Machine API? I looked at their API listing and it didn't seem like I could do what I was trying to do. Maybe I'm missing something, but it seems like a fairly plausible use case. Any other suggestions?
They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.
The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.
That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:
http://archive.org/wayback/available?url=google.com×tamp=20080101 http://archive.org/wayback/available?url=google.com×tamp=20090101 http://archive.org/wayback/available?url=google.com×tamp=20100101 etc..
Using the information returned in those URLs, you can easily download the content programmatically.
Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:
http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com
Lastly we have the deep and mysterious resource that is the Wayback Machine Memento API. That link is to a blog post about the functionality, but from what I can garner, this is about working with the Wayback Machine at a protocol level, where the Mememnto Protocol is a well-thought out version of the way an archive site should operate.
In all cases, please be gentle and respectful with your scripting. The Wayback Machine API does not currently require credentials, which is a very generous and open posture in general keeping with the Internet Archive's role as a "Wonder of the Virtual World". So do not abuse it, because that is how we ensure that we have nice things.
Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.
Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With