There's a column called RDD
blocks in the Spark UI in executors tab. One observation made is that the number of RDD
blocks keeps increasing for a particular streaming job where messages are streamed from Kafka.
Certain executors were removed automatically and application slows down after long run with a large number of RDD
blocks. DStreams
and RDDs
are not persisted manually anywhere.
It would be a great help if someone explains when these blocks are created and on what basis are the blocks being removed (are there any parameters that need to be modified?).
RDDs are immutable in nature i.e. we cannot change the RDD, we need to transform it by applying transformation(s).
RDDs are divided into smaller chunks called Partitions, and when you execute some action, a task is launched per partition. So it means, the more the number of partitions, the more the parallelism.
RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.
You can use the subtractByKey () function to remove the elements with a key present in any other RDD.
Good explanation of Spark UI is this. RDD blocks can represent cached RDD partitions, intermediate shuffle outputs, broadcasts, etc. Check out BlockManager section of this book.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With