Why is Spark not using all cores on local machine

Tags:

When I run some of the Apache Spark examples in the Spark-Shell or as a job, I am not able to achieve full core utilization on a single machine. For example:

var textColumn = sc.textFile("/home/someuser/largefile.txt").cache()
var distinctWordCount = textColumn.flatMap(line => line.split('\0'))
                             .map(word => (word, 1))
                             .reduceByKey(_+_)
                             .count()

When running this script, I mostly see only 1 or 2 active cores on my 8 core machine. Isn't Spark supposed to parallelise this?

275

asked Feb 15 '14 00:02

Johan

1 Answers

You can use local[*] to run Spark locally with as many worker threads as logical cores has your machine.

106

answered Oct 19 '22 14:10

wikier

Related questions
                            
                                How to run Python Spark code on Amazon Aws?
                            
                                Getting OutofMemoryError- GC overhead limit exceed in pyspark
                            
                                Connecting to a remote Spark master - Java / Scala
                            
                                Trying to write dataframe to file, getting org.apache.spark.SparkException: Task failed while writing rows
                            
                                PySpark isin function
                            
                                Spark repartitioning by column with dynamic number of partitions per column
                            
                                Spark Configuration: SPARK_MEM vs. SPARK_WORKER_MEMORY
                            
                                NotSerializableException with json4s on Spark
                            
                                Spark MLLib TFIDF implementation for LogisticRegression
                            
                                Apache Spark error : Could not connect to akka.tcp://sparkMaster@
                            
                                Spark - Checkpointing implication on performance
                            
                                Get all the nodes connected to a node in Apache Spark GraphX
                            
                                SPARK, ML, Tuning, CrossValidator: access the metrics
                            
                                No suitable driver found for jdbc in Spark
                            
                                Why does SparkLauncher return immediately and spawn no job?
                            
                                SQL query Frequency Distribution matrix for product
                            
                                How to load CSVs with timestamps in custom format?
                            
                                Spark-shell meaning of displayed Number on Stage
                            
                                Spark/Yarn: File does not exist on HDFS
                            
                                How to write streaming Dataset to Cassandra?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is Spark not using all cores on local machine

Tags:

parallel-processing

apache-spark

mapreduce

Johan

People also ask

1 Answers

wikier

Recent Activity

Donate For Us