I'm trying to understand Spark's in memory feature. In this process i came across Tachyon which is basically in memory data layer which provides fault tolerance without replication by using lineage systems and reduces re-computation by check-pointing the data-sets. Now where got confused is, all these features are also achievable by Spark's standard RDDs system. So i wonder does RDDs implement Tachyon behind the curtains to implement these features? If not than what is the use of Tachyon where all of its job can be done by standard RDDs. Or am i making some mistake in relating these two? a detailed explanation or link to one will be a great help. Thank you.
Text file RDDs can be created using SparkContext 's textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs:// , s3n:// , etc URI) and reads it as a collection of lines. Here is an example invocation: JavaRDD<String> distFile = sc.
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
Apache Spark Resilient Distributed Dataset(RDD) Action is defined as the spark operations that return raw values. In other words, any of the RDD functions that return other than the RDD[T] is considered an action in the spark programming.
What is in the paper you linked does not reflect the reality of what is in Tachyon as a release open source project, parts of that paper have only ever existed as research prototypes and never been fully integrated into Spark/Tachyon.
When you persist data to the OFF_HEAP
storage level via rdd.persist(StorageLevel.OFF_HEAP)
it uses Tachyon to write that data into Tachyon's memory space as a file. This removes it from the Java heap thus giving Spark more heap memory to work with.
It does not currently write the lineage information so if your data is too large to fit into your configured Tachyon clusters memory portions of the RDD will be lost and your Spark jobs can fail.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With