Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark-streaming and connection pool implementation

The spark-streaming website at https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams mentions the following code:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

I have tried to implement this using org.apache.commons.pool2 but running the application fails with the expected java.io.NotSerializableException:

15/05/26 08:06:21 ERROR OneForOneStrategy: org.apache.commons.pool2.impl.GenericObjectPool
java.io.NotSerializableException: org.apache.commons.pool2.impl.GenericObjectPool
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
 ...

I am wondering how realistic it is to implement a connection pool that is serializable. Has anyone succeeded in doing this ?

Thank you.

like image 895
botkop Avatar asked May 26 '15 06:05

botkop


People also ask

How is Streaming implemented with Spark?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

What is the difference between Spark and Spark Streaming?

Generally, Spark streaming is used for real time processing. But it is an older or rather you can say original, RDD based Spark structured streaming is the newer, highly optimized API for Spark. Users are advised to use the newer Spark structured streaming API for Spark.

Which API is used by Spark Streaming?

Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases, and live dashboards.

Is Spark Streaming deprecated?

Now that the Direct API of Spark Streaming (we currently have version 2.3. 2) is deprecated and we recently added the Confluent platform (comes with Kafka 2.2. 0) to our project we plan to migrate these applications.


1 Answers

Below answer is wrong! I'm leaving the answer here for reference, but the answer is wrong for the following reason. socketPool is declared as a lazy val so it will get instantiated with each first request for access. Since the SocketPool case class is not Serializable, this means that it will get instantiated within each partition. Which makes the connection pool useless because we want to keep connections across partitions and RDDs. It makes no difference wether this is implemented as a companion object or as a case class. Bottom line is: the connection pool must be Serializable, and apache commons pool is not.

import java.io.PrintStream
import java.net.Socket

import org.apache.commons.pool2.{PooledObject, BasePooledObjectFactory}
import org.apache.commons.pool2.impl.{DefaultPooledObject, GenericObjectPool}
import org.apache.spark.streaming.dstream.DStream

/**
 * Publish a Spark stream to a socket.
 */
class PooledSocketStreamPublisher[T](host: String, port: Int)
  extends Serializable {

    lazy val socketPool = SocketPool(host, port)

    /**
     * Publish the stream to a socket.
     */
    def publishStream(stream: DStream[T], callback: (T) => String) = {
        stream.foreachRDD { rdd =>

            rdd.foreachPartition { partition =>

                val socket = socketPool.getSocket
                val out = new PrintStream(socket.getOutputStream)

                partition.foreach { event =>
                    val text : String = callback(event)
                    out.println(text)
                    out.flush()
                }

                out.close()
                socketPool.returnSocket(socket)

            }
        }
    }

}

class SocketFactory(host: String, port: Int) extends BasePooledObjectFactory[Socket] {

    def create(): Socket = {
        new Socket(host, port)
    }

    def wrap(socket: Socket): PooledObject[Socket] = {
        new DefaultPooledObject[Socket](socket)
    }

}

case class SocketPool(host: String, port: Int) {

    val socketPool = new GenericObjectPool[Socket](new SocketFactory(host, port))

    def getSocket: Socket = {
        socketPool.borrowObject
    }

    def returnSocket(socket: Socket) = {
        socketPool.returnObject(socket)
    }

}

which you can invoke as follows:

val socketStreamPublisher = new PooledSocketStreamPublisher[MyEvent](host = "10.10.30.101", port = 29009)
socketStreamPublisher.publishStream(myEventStream, (e: MyEvent) => Json.stringify(Json.toJson(e)))
like image 174
botkop Avatar answered Sep 26 '22 06:09

botkop