I read data from Kafka using <code>DirectKafkaStream</code> API 1, do some transformations, updating a count then writing data back to Kafka. Actually this peace of code is in a test: <pre class="prettyprint"><code>kafkaStream[Key, Value]("test") .map(record => (record.key(), 1)) .updateStateByKey[Int]( (numbers: Seq[Int], state: Option[Int]) => state match { case Some(s) => Some(s + numbers.length) case _ => Some(numbers.length) } ) .checkpoint(this)("count") { case (save: (Key, Int), current: (Key, Int)) => (save._1, save._2 + current._2) } .map(_._2) .reduce(_ + _) .map(count => (new Key, new Result[Long](count.toLong))) .toKafka(Key.Serializer.getClass.getName, Result.longKafkaSerializer.getClass.getName) </code></pre> The <code>checkpoint</code> operator is an enrichment to the <code>DStream</code> API I've created, which should practically save one RDD of the given <code>DStream</code> of one <code>Time</code> into HDFS using <code>saveAsObjectFile</code>. Practically it saves the result of every 60th micro-batch (RDD) into HDFS. Checkpoint does the following: <pre class="prettyprint"><code>def checkpoint(processor: Streaming)(name: String)( mergeStates: (T, T) => T): DStream[T] = { val path = processor.configuration.get[String]( "processing.spark.streaming.checkpoint-directory-prefix") + "/" + Reflection.canonical(processor.getClass) + "/" + name + "/" logInfo(s"Checkpoint base path is [$path].") processor.registerOperator(name) if (processor.fromCheckpoint && processor.restorationPoint.isDefined) { val restorePath = path + processor.restorationPoint.get.ID.stringify logInfo(s"Restoring from path [$restorePath].") checkpointData = context.objectFile[T](restorePath).cache() stream .transform((rdd: RDD[T], time: Time) => { val merged = rdd .union(checkpointData) .map[(Boolean, T)](record => (true, record)) .reduceByKey(mergeStates) .map[T](_._2) processor.maybeCheckpoint(name, merged, time) merged } ) } else { stream .transform((rdd: RDD[T], time: Time) => { processor.maybeCheckpoint(name, rdd, time) rdd }) } } </code></pre> The effective piece of code is the following: <pre class="prettyprint"><code>dstream.transform((rdd: RDD[T], time: Time) => { processor.maybeCheckpoint(name, rdd, time) rdd }) </code></pre> Where <code>dstream</code> variable in the above code is the result of the previous operator, which is <code>updateStateByKey</code>, so a transform is called right after <code>updateStateByKey</code>. <pre class="prettyprint"><code>def maybeCheckpoint(name: String, rdd: RDD[_], time: Time) = { if (doCheckpoint(time)) { logInfo(s"Checkpointing for operator [$name] with RDD ID of [${rdd.id}].") val newPath = configuration.get[String]( "processing.spark.streaming.checkpoint-directory-prefix") + "/" + Reflection.canonical(this.getClass) + "/" + name + "/" + checkpointBarcode logInfo(s"Saving new checkpoint to [$newPath].") rdd.saveAsObjectFile(newPath) registerCheckpoint(name, Operator(name), time) logInfo(s"Checkpoint completed for operator [$name].") } } </code></pre> As you see most of the code is just bookkeeping, but a <code>saveAsObjectFile</code> is called effectively. The problem is that even that the resulting RDDs from <code>updateStateByKey</code> should be persisted automatically, when <code>saveAsObjectFile</code> is called on every Xth micro-batch, Spark will recompute everything, from the scratch, from the beginning of the streaming job, starting off by reading everything from Kafka again. I've tried to put and force <code>cache</code> or <code>persist</code> with different levels of storage, on the DStreams as well as on the RDDs. Micro-batches: <img src="https://i.stack.imgur.com/mgB9h.png" alt="Micro batches"> DAG for job 22: <img src="https://i.stack.imgur.com/QCkoc.png" alt="DAG for job 22"> DAG for job that runs <code>saveAsObjectFile</code>: <img src="https://i.stack.imgur.com/c7C7j.png" alt="SAOF1"> <img src="https://i.stack.imgur.com/eL3wh.png" alt="SAOF2"> What could be the problem? Thanks! 1 Using Spark 2.1.0.

I believe using <code>transform</code> to periodically checkpoint will cause unexpected cache behaviour. Instead using <code>foreachRDD</code> to perform periodic checkpointing will allow the DAG to remain stable enough to effectively cache RDDs. I'm almost positive that was the solution to a similar issue we had a while ago.

Spark DStream periodically call saveAsObjectFile using transform does not work as expected

Tags:

apache-kafka

apache-spark

streaming

hdfs

I read data from Kafka using DirectKafkaStream API 1, do some transformations, updating a count then writing data back to Kafka. Actually this peace of code is in a test:

kafkaStream[Key, Value]("test")
      .map(record => (record.key(), 1))
      .updateStateByKey[Int](
        (numbers: Seq[Int], state: Option[Int]) =>
          state match {
            case Some(s) => Some(s + numbers.length)
            case _ => Some(numbers.length)
          }
      )
      .checkpoint(this)("count") {
        case (save: (Key, Int), current: (Key, Int)) =>
          (save._1, save._2 + current._2)
      }
      .map(_._2)
      .reduce(_ + _)
      .map(count => (new Key, new Result[Long](count.toLong)))
      .toKafka(Key.Serializer.getClass.getName, Result.longKafkaSerializer.getClass.getName)

The checkpoint operator is an enrichment to the DStream API I've created, which should practically save one RDD of the given DStream of one Time into HDFS using saveAsObjectFile. Practically it saves the result of every 60th micro-batch (RDD) into HDFS.

Checkpoint does the following:

def checkpoint(processor: Streaming)(name: String)(
mergeStates: (T, T) => T): DStream[T] = {
val path = processor.configuration.get[String](
  "processing.spark.streaming.checkpoint-directory-prefix") + "/" +
  Reflection.canonical(processor.getClass) + "/" + name + "/"
logInfo(s"Checkpoint base path is [$path].")

processor.registerOperator(name)

if (processor.fromCheckpoint && processor.restorationPoint.isDefined) {
  val restorePath = path + processor.restorationPoint.get.ID.stringify
  logInfo(s"Restoring from path [$restorePath].")
  checkpointData = context.objectFile[T](restorePath).cache()

  stream
    .transform((rdd: RDD[T], time: Time) => {
      val merged = rdd
        .union(checkpointData)
        .map[(Boolean, T)](record => (true, record))
        .reduceByKey(mergeStates)
        .map[T](_._2)

      processor.maybeCheckpoint(name, merged, time)

      merged
    }
  )
} else {
  stream
    .transform((rdd: RDD[T], time: Time) => {
      processor.maybeCheckpoint(name, rdd, time)

      rdd
    })
}
}

The effective piece of code is the following:

dstream.transform((rdd: RDD[T], time: Time) => {
      processor.maybeCheckpoint(name, rdd, time)

      rdd
    })

Where dstream variable in the above code is the result of the previous operator, which is updateStateByKey, so a transform is called right after updateStateByKey.

def maybeCheckpoint(name: String, rdd: RDD[_], time: Time) = {
  if (doCheckpoint(time)) {
    logInfo(s"Checkpointing for operator [$name] with RDD ID of [${rdd.id}].")
    val newPath = configuration.get[String](
    "processing.spark.streaming.checkpoint-directory-prefix") + "/" +
    Reflection.canonical(this.getClass) + "/" + name + "/" + checkpointBarcode
    logInfo(s"Saving new checkpoint to [$newPath].")
    rdd.saveAsObjectFile(newPath)
    registerCheckpoint(name, Operator(name), time)
    logInfo(s"Checkpoint completed for operator [$name].")
  }
}

As you see most of the code is just bookkeeping, but a saveAsObjectFile is called effectively.

The problem is that even that the resulting RDDs from updateStateByKey should be persisted automatically, when saveAsObjectFile is called on every Xth micro-batch, Spark will recompute everything, from the scratch, from the beginning of the streaming job, starting off by reading everything from Kafka again. I've tried to put and force cache or persist with different levels of storage, on the DStreams as well as on the RDDs.

Micro-batches:

Micro batches

DAG for job 22:

DAG for job 22

DAG for job that runs saveAsObjectFile:

SAOF1 SAOF2

What could be the problem?

Thanks!

1 Using Spark 2.1.0.

562

asked Mar 31 '17 19:03

Dyin

1 Answers

I believe using transform to periodically checkpoint will cause unexpected cache behaviour.

Instead using foreachRDD to perform periodic checkpointing will allow the DAG to remain stable enough to effectively cache RDDs.

I'm almost positive that was the solution to a similar issue we had a while ago.

answered Oct 25 '22 17:10

ImDarrenG

Related questions
                            
                                Create Spark DataFrame from nested dictionary
                            
                                Cannot start spark-shell
                            
                                Select specific columns in a PySpark dataframe to improve performance
                            
                                Why would someone run Spark / Flink on Tez?
                            
                                Spark throws java.util.NoSuchElementException: key not found: 67
                            
                                How to import libraries in Spark Notebook
                            
                                Combining/Updating Cassandra Queried data to Structured Streaming receieved from Kafka
                            
                                Spark fails to read CSV when last column name contains spaces
                            
                                Exception: 'writeStream' can be called only on streaming Dataset/DataFrame
                            
                                Amazon EMR and Spark streaming
                            
                                Unsupported authentication token, scheme='none' only allowed when auth is disabled: { scheme='none' } - Neo4j Authentication Error
                            
                                Quarter to date growth
                            
                                Cannot submit Spark app to cluster, stuck on "UNDEFINED"
                            
                                Spark application finished callback
                            
                                Unable to open native connection with spark sometimes
                            
                                How to read and write multiple tables in parallel in Spark?
                            
                                Packaging and Running Scala Spark Project with Maven
                            
                                How do I use Spark's Feature Importance on Random Forest?
                            
                                Why is collect in SparkR so slow?
                            
                                Is it possible to configure Apache Livy to run with Spark Standalone?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With