Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CommonCrawl: How to find a specific web page?

I am using CommonCrawl to restore pages I should have achieved but have not.

In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.

A simple script downloads all indices from the available crawls:

./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on

Afterwards I have 112mb of data and simply grep:

grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r

The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?

Update: Thanks to Sebastian, two links are left... Two URLs are:

  • http://www.thesun.co.uk/sol/homepage/news/50569/Locals-tell-of-terror-shock.html
  • http://www.thesun.co.uk/sol/homepage/news/54032/Sir-Ians-raid-apology.html

They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...

like image 626
Maximilian Böhm Avatar asked Aug 10 '16 09:08

Maximilian Böhm


People also ask

Does Common Crawl include images?

The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

How often is Common Crawl updated?

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.

How big is the Common Crawl dataset?

The new dataset is over 200TB in size containing approximately 2.8 billion webpages. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-35/.


1 Answers

You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

enter image description here

like image 121
Vikash Rathee Avatar answered Sep 30 '22 09:09

Vikash Rathee