Spark Local Mode - all jobs only use one CPU core

Tags:

We are running Spark Java in local mode on a single AWS EC2 instance using

"local[*]"

However, profiling using New Relic tools and a simple 'top' show that only one CPU core of our 16 core machine is ever in use for three different Java spark jobs we've written (we've also tried different AWS instances but only one core is ever used).

Runtime.getRuntime().availableProcessors() reports 16 processors and sparkContext.defaultParallelism() reports 16 as well.

I've looked at various Stackoverflow local mode issues but none seem to have resolved the issue.

Any advice much appreciated.

Thanks

EDIT: Process

1) Use sqlContext to read gzipped CSV file 1 using com.databricks.spark.csv from disc (S3) into DataFrame DF1.

2) Use sqlContext to read gzipped CSV file 2 using com.databricks.spark.csv from disc (S3) into DataFrame DF2.

3) Use DF1.toJavaRDD().mapToPair(new mapping function that returns a Tuple>) RDD1

4) Use DF2.toJavaRDD().mapToPair(new mapping function that returns a Tuple>) RDD2

5) Call union on the RDDs

6) Call reduceByKey() on the unioned RDDs to "merge by key" so have a Tuple>) with only one instance of a particular key (as the same key appears in both RDD1 and RDD2).

7) Call .values().map(new mapping Function which iterates over all items in the provided List and merges them as required to return a List of the same or smaller length

8) Call .flatMap() to get an RDD

9) Use sqlContext to create a DataFrame from the flat map of type DomainClass

10) Use DF.coalease(1).write() to write the DF as gzipped CSV to S3.

760

asked Oct 31 '16 04:10

twiz911

1 Answers

I think your problem is that your CSV files are gzipped. When Spark reads files, it loads them in parallel, but it can only do this if the file codec is splittable*. Plain (non-gzipped) text and parquet are splittable, as well as the bgzip codec used in genomics (my field). Your entire files are ending up in one partition each.

Try decompressing the csv.gz files and running this again. I think you'll see much better results!

splittable formats mean that if you are given an arbitrary file offset at which to start reading, you can find the beginning of the next record in your block and interpret it. Gzipped files are not splittable.

Edit: I replicated this behavior on my machine. Using sc.textFile on a 3G gzipped text file produced 1 partition.

153

answered Sep 30 '22 13:09

Tim

Related questions
                            
                                How to decrease heartbeat time of slave nodes in Hadoop
                            
                                Android BLE: Identify the Characteristic Type?
                            
                                Format JodaTime DateTime with preferred DateFormat
                            
                                Docker. Spring application. set & get environment variable
                            
                                Deploy spark driver application without spark submit
                            
                                Java ExecutorService - What if awaitTermination() fails?
                            
                                Field injection with @Inject in Spring
                            
                                Need to use table alias inside @Formula annotation in Hibernate
                            
                                Unspecified type parameter when calling generic method
                            
                                Data is not saved in DB when using Transaction annotation in JUnit test
                            
                                Is it possible for a callback method to be called after onDestroy?
                            
                                Is it possible to map two objects from same requestBody in spring?
                            
                                Calling a method on an Object from within a Class vs from within a method
                            
                                Spring Data + MongoDB somehow extremely slow after upgrade to macOS Sierra
                            
                                Getter and Setters in Java
                            
                                Issue with Jenkins pipeline and java.nio.file.* methods
                            
                                Spring OutputStream - download pptx with IE
                            
                                Logback MDC with ThreadPools or Spring Async
                            
                                Regex not capturing matching in expected groups
                            
                                Java detect Time Zone after host OS default is modified

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Local Mode - all jobs only use one CPU core

Tags:

java

amazon-web-services

amazon-ec2

apache-spark

twiz911

People also ask

1 Answers

Tim

Recent Activity

Donate For Us