I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted. How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?

Just as an update, downloading the Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to get access to the data. If you want to download via HTTP, get one of the file locations, such as: common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz and then add https://commoncrawl.s3.amazonaws.com/ to it, resulting in the link: https://commoncrawl.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz To get a listing of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) on the more recent crawls, or list the files using anonymous credentials using s3cmd or a similar tool. This link will work and allow you to download the data without going through S3.

Access a common crawl AWS public dataset

Tags:

amazon-web-services

amazon-s3

amazon-ec2

amazon

common-crawl

I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted.
How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?

967

asked May 20 '13 12:05

gibraltar

1 Answers

Just as an update, downloading the Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to get access to the data.

If you want to download via HTTP, get one of the file locations, such as:

common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

and then add https://commoncrawl.s3.amazonaws.com/ to it, resulting in the link:

https://commoncrawl.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

To get a listing of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) on the more recent crawls, or list the files using anonymous credentials using s3cmd or a similar tool.

This link will work and allow you to download the data without going through S3.

answered Sep 20 '22 23:09

Smerity

Related questions
                            
                                MySQL Server shuts down frequently and won't start again now
                            
                                Lots of files appearing in my Amazon S3 bucket
                            
                                Is AWS Lambda considered parallel processing?
                            
                                How to connect to AWS Elasticsearch using the Elasticsearch JavaScript SDK?
                            
                                AWS Glue predicate push down condition has no effect
                            
                                Is AWS Lambda good for real-time API Rest?
                            
                                AWS API GateWay can't have multiple paths?
                            
                                Latest Lambda Layer ARN
                            
                                Why does my AWS Lambda function keep timing out?
                            
                                persistent storage solutions for aws fargate [closed]
                            
                                How to upload and deploy zip file to AWS elastic beanstalk via CLI?
                            
                                How do I connect to an existing CloudSearch domain in boto?
                            
                                Running AWS Command Line Interface using Jenkins: command not found?
                            
                                How to get AWS account details using java api? Not for IAM user
                            
                                AWS IAM Roles and policies in simple English?
                            
                                How to manage AWS credentials when running Docker container with Visual Studio 2017
                            
                                Use two sources in an AWS-CodePipeline pipeline
                            
                                CodeBuild does not upload Build Artifact to S3
                            
                                How to Fan-Out SQS
                            
                                AWS - Cognito Authentication - Curl Call - Generate Token Without CLI - No Client Secret

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With