Spark looses all executors one minute after starting

Tags:

I run pyspark on 8 node Google dataproc cluster with default settings. Few seconds after starting I see 30 executor cores running (as expected):

    >>> sc.defaultParallelism
    30

One minute later:

    >>> sc.defaultParallelism
    2

From that point all actions run on only 2 cores:


    >>> rng = sc.parallelize(range(1,1000000))
    >>> rng.cache()
    >>> rng.count()
    >>> rng.getNumPartitions()
    2

If I run rng.cache() while cores are still connected they stay connected and jobs get distributed.

Checking on monitoring app (port 4040 on master node) shows executors are removed:

Executor 1
Removed at 2016/02/25 16:20:14
Reason: Container container_1456414665542_0006_01_000002 exited from explicit termination request."

Is there some setting that could keep cores connected without workarounds?

405

asked Feb 26 '16 10:02

Tomas Vitulskis

1 Answers

For the most part, what you are seeing is actually just the differences in how Spark on YARN can be configured vs spark standalone. At the moment, YARN's reporting of "VCores Used" doesn't actually correctly correspond to a real container reservation of cores, and containers are actually just based on the memory reservation.

Overall there are a few things at play here:

Dynamic allocation causes Spark to relinquish idle executors back to YARN, and unfortunately at the moment spark prints that spammy but harmless "lost executor" message. This was the classical problem of spark on YARN where spark originally paralyzed clusters it ran on because it would grab the maximum number of containers it thought it needed and then never give them up.

With dynamic allocation, when you start a long job, spark quickly allocates new containers (with something like exponential ramp-up to quickly be able to fill a full YARN cluster within a couple minutes), and when idle, relinquishes executors with the same ramp-down at an interval of about 60 seconds (if idle for 60 seconds, relinquish some executors).

If you want to disable dynamic allocation you can run:

spark-shell --conf spark.dynamicAllocation.enabled=false

gcloud dataproc jobs submit spark --properties spark.dynamicAllocation.enabled=false --cluster <your-cluster> foo.jar

Alternatively, if you specify a fixed number of executors, it should also automatically disable dynamic allocation:

spark-shell --conf spark.executor.instances=123

gcloud dataproc jobs submit spark --properties spark.executor.instances=123 --cluster <your-cluster> foo.jar

answered Nov 03 '22 01:11

DoiT International

Related questions
                            
                                Apache Spark and Nifi Integration
                            
                                Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord")
                            
                                Adding a new column in the first ordinal position in a pyspark dataframe
                            
                                Spark RDD partition by key in exclusive way
                            
                                Pyspark Error:- dataType <class 'pyspark.sql.types.StringType'> should be an instance of <class 'pyspark.sql.types.DataType'>
                            
                                aws: EMR cluster fails "ERROR UserData: Error encountered while try to get user data" on submitting spark job
                            
                                How to use foreach or foreachBatch in PySpark to write to database?
                            
                                Why is repartition faster than partitionBy in Spark?
                            
                                How to parallelize an RDD?
                            
                                How to rename huge amount of files in Hadoop/Spark?
                            
                                Spark - How to use the trained recommender model in production?
                            
                                Shuffled vs non-shuffled coalesce in Apache Spark
                            
                                Change Iterable[(String, Double)] of an RDD to Array or List
                            
                                Spark on embedded mode - user/hive/warehouse not found
                            
                                What happens if an RDD can't fit into memory in Spark? [duplicate]
                            
                                How to upload files to new EMR cluster
                            
                                pyspark split a column to multiple columns without pandas
                            
                                spark.storage.memoryFraction setting in Apache Spark
                            
                                spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted
                            
                                How to convert a sparse vector to dense in Scala Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark looses all executors one minute after starting

Tags:

apache-spark

pyspark

google-cloud-dataproc

Tomas Vitulskis

People also ask

1 Answers

DoiT International

Recent Activity

Donate For Us