Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Tachyon by default implemented by the RDD's in Apache Spark?

I'm trying to understand Spark's in memory feature. In this process i came across Tachyon which is basically in memory data layer which provides fault tolerance without replication by using lineage systems and reduces re-computation by check-pointing the data-sets. Now where got confused is, all these features are also achievable by Spark's standard RDDs system. So i wonder does RDDs implement Tachyon behind the curtains to implement these features? If not than what is the use of Tachyon where all of its job can be done by standard RDDs. Or am i making some mistake in relating these two? a detailed explanation or link to one will be a great help. Thank you.

like image 745
Himanshu Mehra Avatar asked Apr 22 '15 13:04

Himanshu Mehra


People also ask

Which function creates the RDD using SparkContext object?

Text file RDDs can be created using SparkContext 's textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs:// , s3n:// , etc URI) and reads it as a collection of lines. Here is an example invocation: JavaRDD<String> distFile = sc.

What are RDDs in Spark?

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

What is action in Spark RDD?

Apache Spark Resilient Distributed Dataset(RDD) Action is defined as the spark operations that return raw values. In other words, any of the RDD functions that return other than the RDD[T] is considered an action in the spark programming.


1 Answers

What is in the paper you linked does not reflect the reality of what is in Tachyon as a release open source project, parts of that paper have only ever existed as research prototypes and never been fully integrated into Spark/Tachyon.

When you persist data to the OFF_HEAP storage level via rdd.persist(StorageLevel.OFF_HEAP) it uses Tachyon to write that data into Tachyon's memory space as a file. This removes it from the Java heap thus giving Spark more heap memory to work with.

It does not currently write the lineage information so if your data is too large to fit into your configured Tachyon clusters memory portions of the RDD will be lost and your Spark jobs can fail.

like image 96
RobV Avatar answered Oct 20 '22 21:10

RobV