How does Apache Spark know about HDFS data nodes?

Tags:

Imagine I do some Spark operations on a file hosted in HDFS. Something like this:

var file = sc.textFile("hdfs://...")
val items = file.map(_.split('\t'))
...

Because in the Hadoop world the code should go where the data is, right?

So my question is: How do Spark workers know of HDFS data nodes? How does Spark know on which Data Nodes to execute the code?

799

asked Feb 12 '15 15:02

Frizz

1 Answers

Spark reuses Hadoop classes: when you call textFile, it creates a TextInputFormat which has a getSplits method (a split is roughly a partition or block), and then each InputSplit has getLocations and getLocationInfo method.

155

answered Oct 07 '22 17:10

G Quintana

Related questions
                            
                                Why we need Avro schema evolution
                            
                                Hive: Table creation with multi-files with multiple directories
                            
                                Hive throws: WstxParsingException: Illegal character entity: expansion character (code 0x8)
                            
                                NotSerializableException on anonymous class
                            
                                Why does "hadoop fs -mkdir" fail with Permission Denied?
                            
                                Sqoop Import --password-file function not working properly in sqoop 1.4.4
                            
                                Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?
                            
                                Hive enforces schema during read time?
                            
                                Hadoop 2.2.0 fails running start-dfs.sh with Error: JAVA_HOME is not set and could not be found
                            
                                Hadoop: How to unit test FileSystem
                            
                                Getting the following error "Datanode denied communication with namenode" while configuring hadoop 0.23.8
                            
                                Type mismatch in value from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text
                            
                                Sampling a large distributed data set using pyspark / spark
                            
                                Hadoop: Cannot use Jps command
                            
                                Difference between Hadoop and Nosql [closed]
                            
                                Hadoop fs lookup for block size?
                            
                                Hadoop on MAC pseudo node : nodename nor servname provided, or not known
                            
                                Split size vs Block size in Hadoop
                            
                                Container killed by the ApplicationMaster Exit code is 143
                            
                                Hadoop on EC2 vs Elastic Map Reduce

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Apache Spark know about HDFS data nodes?

Tags:

apache-spark

hadoop

hdfs

Frizz

People also ask

1 Answers

G Quintana

Recent Activity

Donate For Us