Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark + hadoop data locality

I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).

Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.

When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?

like image 641
kostas.kougios Avatar asked Jun 23 '15 15:06

kostas.kougios


People also ask

What is data locality in spark?

Data locality in spark helps spark scheduler to run the tasks of compute or caching on the machines where the data is available. This concept came from Hadoop Map/Reduce where data in HDFS will be used to place map operation. This avoided the data movement over network in HDFS.

What is data locality in Hadoop?

In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system. This feature of Hadoop we will discuss in detail in this tutorial.

Which is the best data locality level in spark?

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible. NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node.

Does HDFS promote data locality?

Hadoop increases a data locality in the Hadoop Distributed File System (HDFS) to improve the performance of the system. The network traffic among nodes in the big data system is reduced by increasing a data-local on the machine.


2 Answers

When FileSystem.open(path) gets executed in Spark tasks, File content will be loaded to local variable in same JVM process and prepares the RDD ( partition(s) ). so the data locality for that RDD is always PROCESS_LOCAL

-- vanekjar has already commented the on question


Additional information about data locality in Spark:

There are several levels of locality based on the data’s current location. In order from closest to farthest:

  • PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
  • NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
  • NO_PREF data is accessed equally quickly from anywhere and has no locality preference
  • RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
  • ANY data is elsewhere on the network and not in the same rack

Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.

like image 71
mrsrinivas Avatar answered Nov 15 '22 10:11

mrsrinivas


Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).

like image 41
kbt Avatar answered Nov 15 '22 09:11

kbt