CommonCrawl: How to find a specific web page?

I am using CommonCrawl to restore pages I should have achieved but have not.

In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.

A simple script downloads all indices from the available crawls:

./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on

Afterwards I have 112mb of data and simply grep:

grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r

The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?

Update: Thanks to Sebastian, two links are left... Two URLs are:

http://www.thesun.co.uk/sol/homepage/news/50569/Locals-tell-of-terror-shock.html
http://www.thesun.co.uk/sol/homepage/news/54032/Sir-Ians-raid-apology.html

They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...

Does Common Crawl include images?

The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

How often is Common Crawl updated?

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.

How big is the Common Crawl dataset?

The new dataset is over 200TB in size containing approximately 2.8 billion webpages. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-35/.

You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

enter image description here

CommonCrawl: How to find a specific web page?

Tags:

search-engine

common-crawl

Maximilian Böhm

People also ask

1 Answers

Vikash Rathee

Recent Activity

Donate For Us

CommonCrawl: How to find a specific web page?

Tags:

search-engine

common-crawl

Maximilian Böhm

People also ask

1 Answers

Vikash Rathee

Related questions

Recent Activity

Donate For Us