I have launched a 10 node cluster with the ec2-script in standalone mode for Spark. I am accessing data in s3 buckets from within the PySpark shell but when I perform transormations on the RDD, only one node is ever used. For example the below will read in data from the CommonCorpus: <pre class="prettyprint"><code>bucket = ("s3n://@aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/" "/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10" "-180-212-248.ec2.internal.warc.gz") data = sc.textFile(bucket) data.count() </code></pre> When I run this, only one of my 10 slaves processes the data. I know this because only one slave (213) has any logs of the activity when viewed from the Spark web console. When I view the the activity in Ganglia, this same node (213) is the only slave with a spike in mem usage when the activity was run. <img src="https://i.stack.imgur.com/gMhCa.png" alt="enter image description here"> Furthermore I have the exact same performance when I run the same script with an ec2 cluster of only one slave. I am using Spark 1.1.0 and any help or advice is greatly appreciated.

<blockquote> <code>...ec2.internal.warc.gz</code> </blockquote> I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition. (Note, however, that Spark can load 10 gzipped files in parallel just fine; it's just that each of those 10 files can only be loaded by 1 task. You can still get parallelism across files, just not within a file.) You can confirm that you only have 1 partition by checking the number of partitions in your RDD explicitly: <pre class="prettyprint"><code>data.getNumPartitions() </code></pre> The upper bound on the number of tasks that can run in parallel on an RDD is the number of partitions in the RDD or the number of slave cores in your cluster, whichever is lower. In your case, it's the number of RDD partitions. You can increase that by repartitioning your RDD as follows: <pre class="prettyprint"><code>data = sc.textFile(bucket).repartition(sc.defaultParallelism * 3) </code></pre> Why <code>sc.defaultParallelism * 3</code>? The Spark Tuning guide recommends having 2-3 tasks per core, and <code>sc.defaultParalellism</code> gives you the number of cores in your cluster.

How to fully utilize all Spark nodes in cluster?

Tags:

amazon-ec2

apache-spark

pyspark

I have launched a 10 node cluster with the ec2-script in standalone mode for Spark. I am accessing data in s3 buckets from within the PySpark shell but when I perform transormations on the RDD, only one node is ever used. For example the below will read in data from the CommonCorpus:

bucket = ("s3n://@aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/"
          "/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10"
          "-180-212-248.ec2.internal.warc.gz")

data = sc.textFile(bucket)
data.count()

When I run this, only one of my 10 slaves processes the data. I know this because only one slave (213) has any logs of the activity when viewed from the Spark web console. When I view the the activity in Ganglia, this same node (213) is the only slave with a spike in mem usage when the activity was run. enter image description here

Furthermore I have the exact same performance when I run the same script with an ec2 cluster of only one slave. I am using Spark 1.1.0 and any help or advice is greatly appreciated.

441

asked Dec 17 '14 17:12

Michael David Watson

1 Answers

...ec2.internal.warc.gz

I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.

(Note, however, that Spark can load 10 gzipped files in parallel just fine; it's just that each of those 10 files can only be loaded by 1 task. You can still get parallelism across files, just not within a file.)

You can confirm that you only have 1 partition by checking the number of partitions in your RDD explicitly:

data.getNumPartitions()

The upper bound on the number of tasks that can run in parallel on an RDD is the number of partitions in the RDD or the number of slave cores in your cluster, whichever is lower.

In your case, it's the number of RDD partitions. You can increase that by repartitioning your RDD as follows:

data = sc.textFile(bucket).repartition(sc.defaultParallelism * 3)

Why sc.defaultParallelism * 3?

The Spark Tuning guide recommends having 2-3 tasks per core, and sc.defaultParalellism gives you the number of cores in your cluster.

answered Sep 18 '22 09:09

Nick Chammas

Related questions
                            
                                How to use S3 as static web page and EC2 as REST API for it together? (AWS)
                            
                                Why are the value for Hosted Zone ID different for ELB and Route 53 Alias Target?
                            
                                Docker containers seem to 'inherit' the instance profile of the host ec2. How?
                            
                                Is it possible to block countries IP using the security group on an EC2 instance?
                            
                                Upgrading from T2.medium to T3.medium
                            
                                Is the Cloud ready for an Enterprise Java web application? Seeking a Java EE hosting advice [closed]
                            
                                Installing Hbase / Hadoop on EC2 cluster
                            
                                Subdomain redirecting to another server
                            
                                user data scripts fails without giving reason
                            
                                Java EE application deployment on Amazon EC2
                            
                                can't connect to mysql on AWS RDS (error 2003)
                            
                                several t2.micro better than a single t2.small or t2.medium
                            
                                Solving Redis warnings on overcommit_memory and Transparent Huge Pages for Ubuntu 16.04 on EC2
                            
                                How to start and stop an Amazon EC2 instance programmatically in java
                            
                                Aws lambda describe instances time out
                            
                                Understanding of CPU units in AWS ECS
                            
                                Can't get https working on Elastic Load Balancer (AWS)
                            
                                Determining Amazon EC2 instance creation date/time
                            
                                Codebuild aws command not found when ran?
                            
                                Deploy code directly to AWS EC2 instance using Github Actions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With