Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors :

Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2)
    at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
    at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)

How can I give a checkpoint directory on file system that is not HDFS/Cassandra/any other data store ?

I have thought of two possible solutions, but I do not know how to code them :

  1. have one remote directory which is local to both the workers

  2. specify a remote directory on to both the workers

Any suggestions ?

like image 293
Soumitra Avatar asked Sep 27 '22 10:09

Soumitra


1 Answers

Ok, so I was able to go ahead with the first option.

I mounted a remote directory on all the workers as checkpoint and it worked perfectly.

How to mount the remote checkpoint directory on the workers:

sudo apt-get install sshfs
Load it to kernel

sudo modprobe fuse

sudo adduser username fuse

mkdir ~/checkpoint

sshfs [email protected]:/home/ubuntu/checkpoint ~/checkpoint
like image 145
Soumitra Avatar answered Sep 29 '22 07:09

Soumitra