While using apache-spark, I was trying to apply "reduceByKeyAndWindow()" transformation on some streaming data, and got the following error:
pyspark.sql.utils.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().
Is it necessary to set a checkpoint directory ?
If yes, what is the easiest way to set up one ?
Yes, it is necessary. Checkpointing must be enabled for applications with any of the following requirements:
Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. You can setup checkpoint directory using sc.checkpoint(checkpointDirectoryLocation)
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With