Spark writing/reading to/from S3 - Partition Size and Compression

Tags:

I am doing an experiment to understand which file size behaves best with s3 and [EMR + Spark]

Input data :

Incompressible data: Random Bytes in files 
Total Data Size: 20GB  
Each folder has varying input file size: From 2MB To 4GB file size.

Cluster Specifications :

1 master + 4 nodes : C3.8xls
--driver-memory 5G \
--executor-memory 3G \
--executor-cores 2 \
--num-executors 60 \

Code :

scala> def time[R](block: => R): R = {
          val t0 = System.nanoTime()
          val result = block    // call-by-name
         val t1 = System.nanoTime()
          println("Elapsed time: " + (t1 - t0) + "ns")
          result
      }
time: [R](block: => R)R

scala> val inputFiles = time{sc.textFile("s3://bucket/folder/2mb-10240files-20gb/*/*")};
scala> val outputFiles = time {inputFiles.saveAsTextFile("s3://bucket/folder-out/2mb-10240files-20gb/")};

Observations

2MB - 32MB: Most of the time is spent in opening file handles [Not efficient]

64MB till 1GB: Spark itself is launching 320 tasks for all these file sizes, it's no longer the no of files in that bucket with 20GB data e.g. 512 MB files had 40 files to make 20gb data and could just have 40 tasks to be completed but instead, there were 320
tasks each dealing with 64MB data.

4GB file size : 0 Bytes outputted [Not able to handle in-memory /Data not even splittable ???]

Questions

Any default setting that forces input size to be dealt with to be 64MB ??

Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?

Why is compressed file size increased after uploading via spark? The 2MB compressed input file becomes 3.6 MB in the output bucket.

492

asked Nov 21 '17 23:11

Palak Sukant

1 Answers

Since it is not specified, I'm assuming usage of gzip and Spark 2.2 in my answer.

Any default setting that forces input size to be dealt with to be 64MB ??

Yes, there is. Spark is a Hadoop project, and therefore treats S3 to be a block based file system even though it is an object based file system. So the real question here is: which implementation of S3 file system are you using(s3a, s3n) etc. A similar question can be found here.

Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?

Spark docs indicate that it is capable of reading compressed files:

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").

This means that your files were read quite easily and converted to a plaintext string for each line.

However, you are using compressed files. Assuming it is a non-splittable format such as gzip, the entire file is needed for de-compression. You are running with 3gb executors which can satisfy the needs of 4mb-1gb files quite well, but can't handle a file larger than 3gb at once (probably lesser after accounting for overhead).

Some further info can be found in this question. Details of splittable compression types can be found in this answer.

Why is compressed file size increased after uploading via spark?The 2MB compressed input file becomes 3.6 MB in output bucket.

As a corollary to the previous point, this means that spark has de-compressed the RDD while reading as plaintext. While re-uploading, it is no longer compressed. To compress, you can pass a compression codec as a parameter:

sc.saveAsTextFile("s3://path", classOf[org.apache.hadoop.io.compress.GzipCodec])

There are other compression formats available.

130

answered Sep 19 '22 01:09

Ra41P

Related questions
                            
                                boto3 searching unused security groups
                            
                                Block HEAD requests to AWS Elastic Beanstalk and Elastic Load Balancer
                            
                                AWS EC2 Application Load Balancer + Two-Way SSL?
                            
                                AWS container service: set max_map_count
                            
                                use AWS APIs with Python to use Polly Services
                            
                                ImportError: No module named custom storages - django-storages boto
                            
                                Multi-AZ RDS test failover and connection monitoring
                            
                                CodePipeline unable to locate SAM template yaml file
                            
                                CloudFormation Elasticsearch Service - Circular dependency between resources on same resource
                            
                                Work around for AWS Lambda 500MB /tmp storage limit
                            
                                Amazon MWS with PHP integration
                            
                                Handling S3 Bucket Trigger Event in Lambda Using Python
                            
                                RabbitMq Consumer on AWS Lambda
                            
                                Amazon SNS - Sending SMS, delivery status
                            
                                Configure Nginx for aws s3 static and media files
                            
                                Issue with creating a Postgres RDS in Cloudformation Template
                            
                                Put Items using Json File in AWS DynamoDB using AWS CLI
                            
                                How do I setup an Endpoint URL with a Path Parameter within API Gateway?
                            
                                Serverless AWS Lambda CORS Error
                            
                                What triggers ENIs to be created for AWS Lambdas accessing VPC resources

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark writing/reading to/from S3 - Partition Size and Compression

Tags:

amazon-web-services

gzip

amazon-s3

apache-spark

Palak Sukant

People also ask

1 Answers

Ra41P

Recent Activity

Donate For Us