Imagine I do some Spark operations on a file hosted in HDFS. Something like this:
var file = sc.textFile("hdfs://...")
val items = file.map(_.split('\t'))
...
Because in the Hadoop world the code should go where the data is, right?
So my question is: How do Spark workers know of HDFS data nodes? How does Spark know on which Data Nodes to execute the code?
Spark uses partitioner property to determine the algorithm to determine on which worker that particular record of RDD should be stored on. When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file.
From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon's S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.
Using InputFormat means it is reusing logic that can determine where input splits are located. This is used for scheduling. No using YARN is not required, each Spark worker knows on which node it is running. Then, the Spark master can select worker nodes based on data location (and available resources).
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.
Spark reuses Hadoop classes: when you call textFile
, it creates a TextInputFormat which has a getSplits
method (a split is roughly a partition or block), and then each InputSplit has getLocations
and getLocationInfo
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With