Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When are Spark RDD blocks created and destroyed/removed?

There's a column called RDD blocks in the Spark UI in executors tab. One observation made is that the number of RDD blocks keeps increasing for a particular streaming job where messages are streamed from Kafka.

Certain executors were removed automatically and application slows down after long run with a large number of RDD blocks. DStreams and RDDs are not persisted manually anywhere.

It would be a great help if someone explains when these blocks are created and on what basis are the blocks being removed (are there any parameters that need to be modified?).

like image 927
nitin angadi Avatar asked Apr 12 '18 11:04

nitin angadi


People also ask

Can data in RDD be changed once RDD is created?

RDDs are immutable in nature i.e. we cannot change the RDD, we need to transform it by applying transformation(s).

When an RDD is created it is broken into smaller pieces that can be distributed over the cluster What are these pieces called?

RDDs are divided into smaller chunks called Partitions, and when you execute some action, a task is launched per partition. So it means, the more the number of partitions, the more the parallelism.

How RDD are created?

RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.

How do you remove elements from RDD?

You can use the subtractByKey () function to remove the elements with a key present in any other RDD.


1 Answers

Good explanation of Spark UI is this. RDD blocks can represent cached RDD partitions, intermediate shuffle outputs, broadcasts, etc. Check out BlockManager section of this book.

like image 62
Eugene Lopatkin Avatar answered Oct 13 '22 13:10

Eugene Lopatkin