I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).
Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.
When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?
Data locality in spark helps spark scheduler to run the tasks of compute or caching on the machines where the data is available. This concept came from Hadoop Map/Reduce where data in HDFS will be used to place map operation. This avoided the data movement over network in HDFS.
In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system. This feature of Hadoop we will discuss in detail in this tutorial.
PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible. NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node.
Hadoop increases a data locality in the Hadoop Distributed File System (HDFS) to improve the performance of the system. The network traffic among nodes in the big data system is reduced by increasing a data-local on the machine.
When
FileSystem.open(path)
gets executed in Spark tasks, File content will be loaded to local variable in same JVM process and prepares the RDD ( partition(s) ). so the data locality for that RDD is alwaysPROCESS_LOCAL
-- vanekjar has already commented the on question
Additional information about data locality in Spark:
There are several levels of locality based on the data’s current location. In order from closest to farthest:
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.
Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With