Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checkpointing In ALS Spark Scala

I just want to ask on the specifics how to successfully use checkpointInterval in Spark. And what do you mean by this comment in the code for ALS: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

If the checkpoint directory is not set in [[org.apache.spark.SparkContext]], * this setting is ignored.

  1. How can we set checkPoint directory? Can we use any hdfs-compatible directory for this?
  2. Is using setCheckpointInterval the correct way to implement checkpointing in ALS to avoid Stack Overflow errors?

Edit:

enter image description here

enter image description here

like image 374
Alger Remirata Avatar asked Jan 06 '16 13:01

Alger Remirata


People also ask

What is the use of checkpointing in spark?

There are two types of data we checkpoint in Spark : Metadata Checkpointing : – Metadata means data about the data. Metadata checkpointing is used to recover the streaming application driver node from failure. It includes configurations used to create the application, DStream operations and incomplete batches.

What is RDD checkpointing in spark?

RDD Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. There are two types of checkpointing: < > - RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system (e.g. Hadoop DFS)

What is local checkpointing?

• A local check point is a snapshot of the state of the process at a. given instance. • Assumption. – A process stores all local checkpoints on the stable storage. – A process is able to roll back to any of its existing local checkpoints.

What is checkpointing Pyspark?

Checkpointing can be used to truncate the logical plan of this DataFrame , which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext. setCheckpointDir() . New in version 2.1.


1 Answers

How can we set checkPoint directory? Can we use any hdfs-compatible directory for this?

You can use SparkContext.setCheckpointDir. As far as I remember in local mode both local and DFS paths work just fine, but on the cluster the directory must be a HDFS path.

Is using setCheckpointInterval the correct way to implement checkpointing in ALS to avoid Stack Overflow errors?

It should help. See SPARK-1006

PS: It seems that in order to actually perform check-point in ALS, the checkpointDir must be set or check-pointing won't be effective [Ref. here.]

like image 118
zero323 Avatar answered Sep 28 '22 14:09

zero323