Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Apache Spark know about HDFS data nodes?

Imagine I do some Spark operations on a file hosted in HDFS. Something like this:

var file = sc.textFile("hdfs://...")
val items = file.map(_.split('\t'))
...

Because in the Hadoop world the code should go where the data is, right?

So my question is: How do Spark workers know of HDFS data nodes? How does Spark know on which Data Nodes to execute the code?

like image 799
Frizz Avatar asked Feb 12 '15 15:02

Frizz


People also ask

How does Spark read data from HDFS?

Spark uses partitioner property to determine the algorithm to determine on which worker that particular record of RDD should be stored on. When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file.

How does Spark interact with HDFS?

From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon's S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.

How Spark knows where the data is?

Using InputFormat means it is reusing logic that can determine where input splits are located. This is used for scheduling. No using YARN is not required, each Spark worker knows on which node it is running. Then, the Spark master can select worker nodes based on data location (and available resources).

Does Apache Spark use HDFS?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.


1 Answers

Spark reuses Hadoop classes: when you call textFile, it creates a TextInputFormat which has a getSplits method (a split is roughly a partition or block), and then each InputSplit has getLocations and getLocationInfo method.

like image 155
G Quintana Avatar answered Oct 07 '22 17:10

G Quintana