Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Access a common crawl AWS public dataset

I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted.
How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?

like image 967
gibraltar Avatar asked May 20 '13 12:05

gibraltar


People also ask

How do you access Common Crawl data?

https://data.commoncrawl.org/ to the file path.

What is public dataset AWS?

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.

What is C4 dataset?

C4 (Colossal Clean Crawled Corpus) C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models. The dataset can be downloaded in a pre-processed form from allennlp.


1 Answers

Just as an update, downloading the Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to get access to the data.

If you want to download via HTTP, get one of the file locations, such as:

common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

and then add https://commoncrawl.s3.amazonaws.com/ to it, resulting in the link:

https://commoncrawl.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

To get a listing of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) on the more recent crawls, or list the files using anonymous credentials using s3cmd or a similar tool.

This link will work and allow you to download the data without going through S3.

like image 56
Smerity Avatar answered Sep 20 '22 23:09

Smerity