Spark Structured Streaming ForeachWriter and database performance

Tags:

I've had a go implementing a structured stream like so...

myDataSet
  .map(r =>  StatementWrapper.Transform(r))
  .writeStream
  .foreach(MyWrapper.myWriter)
  .start()
  .awaitTermination()

This all seems to work, but looking at the throughput of MyWrapper.myWriter is horrible. It's effectively trying to be a JDBC sink, it looks like:

val myWriter: ForeachWriter[Seq[String]] = new ForeachWriter[Seq[String]] {

  var connection: Connection = _

  override def open(partitionId: Long, version: Long): Boolean = {
    Try (connection = getRemoteConnection).isSuccess
  }

  override def process(row: Seq[String]) {
    val statement = connection.createStatement()
    try {
      row.foreach( s => statement.execute(s) )
    } catch {
      case e: SQLSyntaxErrorException => println(e)
      case e: SQLException => println(e)
    } finally {
      statement.closeOnCompletion()
    }
  }

  override def close(errorOrNull: Throwable) {
    connection.close()
  }
}

So my question is - Is the new ForeachWriter instantiated for each row ? thus the open() and close() is called for every row in the dataset ?

Is there a better design to improve throughput ?

How to parse SQL statement once and execute many times, also keep the database connection open?

233

asked Oct 18 '17 22:10

Exie

1 Answers

Opening and closing of the underlying sink depends on your implementation of ForeachWriter.

The relevant class which invokes ForeachWriter is the ForeachSink, and this is the code which calls your writer:

data.queryExecution.toRdd.foreachPartition { iter =>
  if (writer.open(TaskContext.getPartitionId(), batchId)) {
    try {
      while (iter.hasNext) {
        writer.process(encoder.fromRow(iter.next()))
      }
    } catch {
      case e: Throwable =>
        writer.close(e)
        throw e
    }
    writer.close(null)
  } else {
    writer.close(null)
  }
}

Opening and closing of the writer is attempted foreach batch that is generated from your source. If you want open and close to be literally open and close the sink driver each time, you'll need to do so via your implementation.

If you want more control over how the data is handled, you can implement the Sink trait which gives a batch id and the underlying DataFrame:

trait Sink {
  def addBatch(batchId: Long, data: DataFrame): Unit
}

165

answered Sep 19 '22 13:09

Yuval Itzchakov

Related questions
                            
                                How do I index two arrays in MongoDB?
                            
                                graph database in nodejs
                            
                                Transfer database from SQL Server 2012 to SQL Server 2008 [closed]
                            
                                Storing Images In DB Using Django Models
                            
                                What is the actual difference between Data Warehouse & Big Data?
                            
                                Appropriate database for web analytics?
                            
                                How to change read-only permission to set new value of MySQL server system variable
                            
                                List of values : Code constants or database?
                            
                                Can query be optimized: Get a records max date then join the max date's values
                            
                                How to make a copy of large database from phpmyadmin?
                            
                                Storing conditional logic expressions/rules in a database
                            
                                What's the best way to deprecate a column in a database schema?
                            
                                Storing and Querying GPS Coordinates Effectively
                            
                                Using COLLATE in Android SQLite - Locales is ignored in LIKE statement
                            
                                node.js design pattern for creating db connection once
                            
                                Play, Hibernate and Evolutions
                            
                                Clojure database unit testing / mocking
                            
                                Are there plans to support "type providers" for Scala's SIQ (ScalaIntegratedQuery) like in F#?
                            
                                INSERT of 10 million queries under 10 minutes in Oracle?
                            
                                store many of relation 1:1 between various type of objects : decoupling & high performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Structured Streaming ForeachWriter and database performance

Tags:

database

scala

jdbc

apache-spark

spark-structured-streaming

Exie

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us