I am using CommonCrawl to restore pages I should have achieved but have not.
In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.
A simple script downloads all indices from the available crawls:
./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on
Afterwards I have 112mb of data and simply grep:
grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r
The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?
Update: Thanks to Sebastian, two links are left... Two URLs are:
They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...
The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.
Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.
The new dataset is over 200TB in size containing approximately 2.8 billion webpages. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-35/.
You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With