I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted.
How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?
https://data.commoncrawl.org/ to the file path.
Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
C4 (Colossal Clean Crawled Corpus) C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models. The dataset can be downloaded in a pre-processed form from allennlp.
Just as an update, downloading the Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to get access to the data.
If you want to download via HTTP, get one of the file locations, such as:
common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz
and then add https://commoncrawl.s3.amazonaws.com/ to it, resulting in the link:
https://commoncrawl.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz
To get a listing of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) on the more recent crawls, or list the files using anonymous credentials using s3cmd or a similar tool.
This link will work and allow you to download the data without going through S3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With