I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as: <pre class="prettyprint"><code>s3://<bucket>/project1/20141201/logtype1/logtype1.0000.gz s3://<bucket>/project1/20141201/logtype1/logtype1.0100.gz .... s3://<bucket>/project1/20141201/logtype1/logtype1.2300.gz </code></pre> Since the data should be analyzed on a daily basis, I have to download and decompress the files belongs to a specific day, then assemble the content as a single RDD. There should be several ways can do this, but I would like to know the best practice for Spark. Thanks in advance.

The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression. From the Spark docs: <blockquote> All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use <code>textFile("/my/directory")</code>, <code>textFile("/my/directory/*.txt")</code>, and <code>textFile("/my/directory/*.gz")</code>. </blockquote> So in your case you should be able to open all those files as a single RDD using something like this: <pre class="prettyprint"><code>rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz") </code></pre> Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the <code>*</code> and <code>?</code> wildcards. For example: <pre class="prettyprint"><code>rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt") </code></pre> Briefly, what this does is: <ul> <li>The <code>*</code> matches all strings, so in this case all <code>gz</code> files in all folders under <code>201412??</code> will be loaded.</li> <li>The <code>?</code> matches a single character, so <code>201412??</code> will cover all days in December 2014 like <code>20141201</code>, <code>20141202</code>, and so forth.</li> <li>The <code>,</code> lets you just load separate files at once into the same RDD, like the <code>random-file.txt</code> in this case.</li> </ul> Some notes about the appropriate URL scheme for S3 paths: <ul> <li>If you're running Spark on EMR, the correct URL scheme is <code>s3://</code>.</li> <li>If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer, <code>s3a://</code> is the way to go.</li> <li> <code>s3n://</code> has been deprecated on the open source side in favor of <code>s3a://</code>. You should only use <code>s3n://</code> if you're running Spark on Hadoop 2.6 or older.</li> </ul>

Note: Under Spark 1.2, the proper format would be as follows: <pre class="prettyprint"><code>val rdd = sc.textFile("s3n://<bucket>/<foo>/bar.*.gz") </code></pre> That's <code>s3n://</code>, not <code>s3://</code> You'll also want to put your credentials in <code>conf/spark-env.sh</code> as <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code>.

How to read multiple gzipped files from S3 into a single RDD?

Tags:

amazon-s3

apache-spark

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:

s3://<bucket>/project1/20141201/logtype1/logtype1.0000.gz
s3://<bucket>/project1/20141201/logtype1/logtype1.0100.gz
....
s3://<bucket>/project1/20141201/logtype1/logtype1.2300.gz

Since the data should be analyzed on a daily basis, I have to download and decompress the files belongs to a specific day, then assemble the content as a single RDD.

There should be several ways can do this, but I would like to know the best practice for Spark.

Thanks in advance.

268

asked Dec 15 '14 05:12

shihpeng

2 Answers

The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.

From the Spark docs:

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

So in your case you should be able to open all those files as a single RDD using something like this:

rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")

Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the * and ? wildcards.

For example:

rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")

Briefly, what this does is:

The * matches all strings, so in this case all gz files in all folders under 201412?? will be loaded.
The ? matches a single character, so 201412?? will cover all days in December 2014 like 20141201, 20141202, and so forth.
The , lets you just load separate files at once into the same RDD, like the random-file.txt in this case.

Some notes about the appropriate URL scheme for S3 paths:

If you're running Spark on EMR, the correct URL scheme is s3://.
If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer, s3a:// is the way to go.
s3n:// has been deprecated on the open source side in favor of s3a://. You should only use s3n:// if you're running Spark on Hadoop 2.6 or older.

113

answered Oct 12 '22 18:10

Nick Chammas

Note: Under Spark 1.2, the proper format would be as follows:

val rdd = sc.textFile("s3n://<bucket>/<foo>/bar.*.gz")

That's s3n://, not s3://

You'll also want to put your credentials in conf/spark-env.sh as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

answered Oct 12 '22 18:10

Joseph Lust

Related questions
                            
                                CloudFront and API Gateway service on same domain
                            
                                How to mount S3 bucket on Kubernetes container/pods?
                            
                                django-compressor not setting absolute CSS image paths on Heroku
                            
                                java.lang.ClassNotFoundException: com.amazonaws.ClientConfigurationFactory
                            
                                uninitialized constant AWS::S3::Base via AWS-SDK
                            
                                rails + carrierwave + fog + S3 socket error
                            
                                AWS was not able to validate the provided access credentials
                            
                                maven s3 wagon provider
                            
                                Amazon S3 - Your proposed upload is smaller than the minimum allowed size
                            
                                reading files triggered by s3 event
                            
                                RDS to S3 using pg_dump directly (without intermediary)
                            
                                Reading files from AWS S3 in Golang
                            
                                Problems when uploading large files to Amazon S3
                            
                                Paperclip S3 - Can upload images but cannot view them
                            
                                How to configure authorization mechanism inline with boto3
                            
                                AmazonS3ClientBuilder.defaultClient() fails to account for region?
                            
                                API Gateway Proxy for S3 with subdirectories
                            
                                Env Variable in .ebextensions "files:" section
                            
                                Create Sub folder in S3 Bucket?
                            
                                Uploading to Amazon S3 via curl route

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With