I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . But I am not able to find any good example to use the AWS.
Questions
With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.
A checkpoint helps build fault-tolerant and resilient Spark applications. Spark Structured Streaming maintains an intermediate state on HDFS compatible file systems to recover from failures. To specify the checkpoint in a streaming query, we use the checkpointLocation parameter.
Types of Checkpointing in Apache SparkThere are 2 types of Apache Spark checkpointing: Reliable Checkpointing :- Checkpointing in which actual RDD is saved to reliable distributed storage i.e. HDFS. We need to call the SparkContext. setCheckpointDir(directory: String) method to set the checkpointing directory.
Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.
From the answer in the link
Solution 1:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
ssc.checkpoint(checkpointDirectory)
Set the checkpoint directory as S3 URL -
s3n://spark-streaming/checkpoint
And then launch your spark application using spark submit.
This works in spark 1.4.2
solution 2:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
StreamingContext.getOrCreate(checkPointDir, () => {
createStreamingContext(checkPointDir, config)
}, hadoopConf)
To Checkpoint to S3, you can pass the following notation to StreamingContext def checkpoint(directory: String): Unit
method
s3n://<aws-access-key>:<aws-secret-key>@<s3-bucket>/<prefix ...>
Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With