spark + hadoop data locality

Tags:

I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).

Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.

When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?

641

asked Jun 23 '15 15:06

kostas.kougios

2 Answers

When FileSystem.open(path) gets executed in Spark tasks, File content will be loaded to local variable in same JVM process and prepares the RDD ( partition(s) ). so the data locality for that RDD is always PROCESS_LOCAL

-- vanekjar has already commented the on question

Additional information about data locality in Spark:

There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack

Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.

answered Nov 15 '22 10:11

mrsrinivas

Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).

answered Nov 15 '22 09:11

kbt

Related questions
                            
                                How to set the number of partitions for newAPIHadoopFile?
                            
                                YARN Application Master unable to connect to Resource Manager
                            
                                how to add external jar to hadoop job?
                            
                                How to change the version of Java that CDH uses
                            
                                Performance of Apache Drill
                            
                                What does the Spark UI light blue part of Tasks progress bar indicate?
                            
                                Is Cassandra for OLAP or OLTP or both?
                            
                                Cannot load main class from JAR file
                            
                                What does virtual core in YARN vcore mean?
                            
                                Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
                            
                                How do we convert a string into Array in hive?
                            
                                Why does all columns get created as string when I use OpenCSVSerde in Hive?
                            
                                How HBase partitions table across regionservers?
                            
                                Hive/HBase Integration - Zookeeper Session Closes Immediately
                            
                                Debugging in PIG UDF
                            
                                How can I force Flume-NG to process the backlog of events after a sink failed?
                            
                                How to remove an ambari service after they have been added
                            
                                What is the difference between classic, local for mapreduce.framework.name in mapred-site.xml?
                            
                                using pyspark, read/write 2D images on hadoop file system
                            
                                How can I merge spark results files without repartition and copyMerge?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark + hadoop data locality

Tags:

apache-spark

hadoop

hdfs

kostas.kougios

People also ask

2 Answers

mrsrinivas

kbt

Recent Activity

Donate For Us