When are Spark RDD blocks created and destroyed/removed?

Tags:

There's a column called RDD blocks in the Spark UI in executors tab. One observation made is that the number of RDD blocks keeps increasing for a particular streaming job where messages are streamed from Kafka.

Certain executors were removed automatically and application slows down after long run with a large number of RDD blocks. DStreams and RDDs are not persisted manually anywhere.

It would be a great help if someone explains when these blocks are created and on what basis are the blocks being removed (are there any parameters that need to be modified?).

927

asked Apr 12 '18 11:04

nitin angadi

1 Answers

Good explanation of Spark UI is this. RDD blocks can represent cached RDD partitions, intermediate shuffle outputs, broadcasts, etc. Check out BlockManager section of this book.

answered Oct 13 '22 13:10

Eugene Lopatkin

Related questions
                            
                                Object spark is not a member of package org
                            
                                How to get a spark job's metrics?
                            
                                Is this a bug of spark stream or memory leak?
                            
                                PySpark s3 Access with Multiple AWS Credential Profiles?
                            
                                What to use to have graphical view of Spark's memory usage (with YARN)?
                            
                                Apache Spark sort partition by user ID and write each partition to CSV
                            
                                Why does sbt assembly fail with "Not a valid command: assembly"?
                            
                                Lost executor Spark
                            
                                PySpark: Numpy memory not being released in executor map-partition function (memory leak)
                            
                                Joining Spark DataFrames on a nearest key condition
                            
                                I cannot use --package option on bitnami/spark docker container
                            
                                Spark MLlib - Collaborative Filtering Implicit Feed
                            
                                Spark: What is the time complexity of the connected components algorithm used in GraphX?
                            
                                How to repartition evenly in Spark?
                            
                                Out of memory error when writing out spark dataframes to parquet format
                            
                                Difference between a map and udf
                            
                                Cassandra Error message: Not marking nodes down due to local pause. Why?
                            
                                Spark on localhost
                            
                                Spark RDD- map vs mapPartitions
                            
                                Sending Spark streaming metrics to open tsdb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When are Spark RDD blocks created and destroyed/removed?

Tags:

apache-spark

rdd

spark-streaming

nitin angadi

People also ask

1 Answers

Eugene Lopatkin

Recent Activity

Donate For Us