Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark Checkpoint Directory is not set

While using apache-spark, I was trying to apply "reduceByKeyAndWindow()" transformation on some streaming data, and got the following error:

pyspark.sql.utils.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().

Is it necessary to set a checkpoint directory ?

If yes, what is the easiest way to set up one ?

like image 870
Sachin Avatar asked Feb 09 '23 13:02

Sachin


1 Answers

Yes, it is necessary. Checkpointing must be enabled for applications with any of the following requirements:

Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.

Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. You can setup checkpoint directory using sc.checkpoint(checkpointDirectoryLocation)

http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

like image 153
morfious902002 Avatar answered Feb 11 '23 03:02

morfious902002