Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access Wayback Machine programmatically?

Tags:

web-scraping

What I'm trying to do

For a list of websites, I want to get the pages indexed by year, if they were archived at any point that year. So if I'm looking at example1.com and example2.com, I want to be able to get:

2010: example1.com, example2.com (the html from these archived pages)
2011: example1.com (example2.com, say, was not archived in 2011)
2012: example2.com
2013: example1.com, example2.com

and so on.

Question

Is this possible to get using the Wayback Machine API? I looked at their API listing and it didn't seem like I could do what I was trying to do. Maybe I'm missing something, but it seems like a fairly plausible use case. Any other suggestions?

like image 735
ShivanKaul Avatar asked Nov 19 '15 18:11

ShivanKaul


2 Answers

They key thing to understand about the Wayback Machine APIs is that there are (from what I can tell) three different ways to work with them.

Wayback Availability JSON API

The first is the API which is documented near the top of the Wayback Machine API page you already mentioned.

That API gives the date-wise nearest result for an archive on a given page. So you can check the Wayback Machine for copies of the Google homepage archived around New Year's Day like so:

http://archive.org/wayback/available?url=google.com&timestamp=20080101 http://archive.org/wayback/available?url=google.com&timestamp=20090101 http://archive.org/wayback/available?url=google.com&timestamp=20100101 etc..

Using the information returned in those URLs, you can easily download the content programmatically.

Wayback CDX Server API

Next we have the Wayback Machine CDX Server API which reveals a much richer series of interfaces. Most notably, you can quickly download every snapshot of a URL that you are interested in:

http://web.archive.org/cdx/search/cdx?url=www.fredtrotter.com

Memento API

Lastly we have the deep and mysterious resource that is the Wayback Machine Memento API. That link is to a blog post about the functionality, but from what I can garner, this is about working with the Wayback Machine at a protocol level, where the Mememnto Protocol is a well-thought out version of the way an archive site should operate.

Final thoughts

In all cases, please be gentle and respectful with your scripting. The Wayback Machine API does not currently require credentials, which is a very generous and open posture in general keeping with the Internet Archive's role as a "Wonder of the Virtual World". So do not abuse it, because that is how we ensure that we have nice things.

Thanks to Greg, and the rest of the Wayback Machine team, for the excellent work you do to keep the Internet a source of personal freedom and expression.

like image 141
ftrotter Avatar answered Sep 20 '22 08:09

ftrotter


Our CDX API allows you to make 2 separate calls get a list of all captures for the url or domain example1.com and the url or domain example2.com. You can then produce whatever summary you like.

like image 25
Greg Lindahl Avatar answered Sep 22 '22 08:09

Greg Lindahl