Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Streaming checkpoint to amazon s3

I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . But I am not able to find any good example to use the AWS.

Questions

  • How do I use Amazon S3 as checkpoint directory ?Is it just enough to call ssc.checkpoint(amazons3url) ?
  • Is it possible to have any other reliable data storage other than hadoop file system for checkpoint ?
like image 767
Knight71 Avatar asked Nov 02 '15 10:11

Knight71


People also ask

Can spark connect to S3?

With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.

Why do we use checkpoint in spark Streaming?

A checkpoint helps build fault-tolerant and resilient Spark applications. Spark Structured Streaming maintains an intermediate state on HDFS compatible file systems to recover from failures. To specify the checkpoint in a streaming query, we use the checkpointLocation parameter.

Does Apache spark provide checkpoints?

Types of Checkpointing in Apache SparkThere are 2 types of Apache Spark checkpointing: Reliable Checkpointing :- Checkpointing in which actual RDD is saved to reliable distributed storage i.e. HDFS. We need to call the SparkContext. setCheckpointDir(directory: String) method to set the checkpointing directory.

What is the difference between spark Streaming and structured Streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.


2 Answers

From the answer in the link

Solution 1:

export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
ssc.checkpoint(checkpointDirectory)

Set the checkpoint directory as S3 URL - s3n://spark-streaming/checkpoint

And then launch your spark application using spark submit. This works in spark 1.4.2

solution 2:

  val hadoopConf: Configuration = new Configuration()
  hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
  hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
  hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")

  StreamingContext.getOrCreate(checkPointDir, () => {
        createStreamingContext(checkPointDir, config)
      }, hadoopConf)
like image 118
Knight71 Avatar answered Oct 20 '22 03:10

Knight71


To Checkpoint to S3, you can pass the following notation to StreamingContext def checkpoint(directory: String): Unit method

s3n://<aws-access-key>:<aws-secret-key>@<s3-bucket>/<prefix ...>

Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon

like image 44
Jeremy Sanecki Avatar answered Oct 20 '22 04:10

Jeremy Sanecki