I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . But I am not able to find any good example to use the AWS. Questions <ul> <li>How do I use Amazon S3 as checkpoint directory ?Is it just enough to call ssc.checkpoint(amazons3url) ?</li> <li>Is it possible to have any other reliable data storage other than hadoop file system for checkpoint ?</li> </ul>

From the answer in the link Solution 1: <pre class="prettyprint"><code>export AWS_ACCESS_KEY_ID=<your access> export AWS_SECRET_ACCESS_KEY=<your secret> ssc.checkpoint(checkpointDirectory) </code></pre> Set the checkpoint directory as S3 URL - <code>s3n://spark-streaming/checkpoint</code> And then launch your spark application using spark submit. This works in <code>spark 1.4.2</code> solution 2: <pre class="prettyprint"><code> val hadoopConf: Configuration = new Configuration() hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1") hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key") StreamingContext.getOrCreate(checkPointDir, () => { createStreamingContext(checkPointDir, config) }, hadoopConf) </code></pre>

To Checkpoint to S3, you can pass the following notation to StreamingContext <code>def checkpoint(directory: String): Unit</code> method <pre class="prettyprint"><code>s3n://<aws-access-key>:<aws-secret-key>@<s3-bucket>/<prefix ...> </code></pre> Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon

Spark Streaming checkpoint to amazon s3

Tags:

spark-streaming

I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . But I am not able to find any good example to use the AWS.

Questions

How do I use Amazon S3 as checkpoint directory ?Is it just enough to call ssc.checkpoint(amazons3url) ?
Is it possible to have any other reliable data storage other than hadoop file system for checkpoint ?

767

asked Nov 02 '15 10:11

Knight71

2 Answers

From the answer in the link

Solution 1:

export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
ssc.checkpoint(checkpointDirectory)

Set the checkpoint directory as S3 URL - s3n://spark-streaming/checkpoint

And then launch your spark application using spark submit. This works in spark 1.4.2

solution 2:

  val hadoopConf: Configuration = new Configuration()
  hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
  hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
  hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")

  StreamingContext.getOrCreate(checkPointDir, () => {
        createStreamingContext(checkPointDir, config)
      }, hadoopConf)

118

answered Oct 20 '22 03:10

Knight71

To Checkpoint to S3, you can pass the following notation to StreamingContext def checkpoint(directory: String): Unit method

s3n://<aws-access-key>:<aws-secret-key>@<s3-bucket>/<prefix ...>

Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon

answered Oct 20 '22 04:10

Jeremy Sanecki

Related questions
                            
                                How to compute the top k words
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                How to clean spark history event log with out stopping spark streaming
                            
                                How to convert Spark Streaming data into Spark DataFrame
                            
                                Restarting Spark Structured Streaming Job consumes Millions of Kafka messages and dies
                            
                                How to read records from Kafka topic from beginning in Spark Streaming?
                            
                                foldLeft or foldRight equivalent in Spark?
                            
                                Spark Job running on Yarn Cluster java.io.FileNotFoundException: File does not exits , eventhough the file exits on the master node
                            
                                Reading from Cassandra using Spark Streaming
                            
                                Spark Dataframe Returning NULL when specifying a Schema
                            
                                What does "streaming" mean in Apache Spark and Apache Flink?
                            
                                'Connection Refused' error while running Spark Streaming on local machine
                            
                                How to Stop running Spark Streaming application Gracefully?
                            
                                reading json file in pyspark
                            
                                Build stateful chain for different events and assign global ID in spark
                            
                                Spark: Find pairs having at least n common attributes?
                            
                                Spark Streaming textFileStream not supporting wildcards
                            
                                Spark Streaming Accumulated Word Count
                            
                                Kafkaconsumer is not safe for multi-threading access

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Streaming checkpoint to amazon s3

Tags:

spark-streaming

Knight71

People also ask

2 Answers

Knight71

Jeremy Sanecki

Recent Activity

Donate For Us