spark-streaming and connection pool implementation

Tags:

spark-streaming

The spark-streaming website at https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams mentions the following code:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

I have tried to implement this using org.apache.commons.pool2 but running the application fails with the expected java.io.NotSerializableException:

15/05/26 08:06:21 ERROR OneForOneStrategy: org.apache.commons.pool2.impl.GenericObjectPool
java.io.NotSerializableException: org.apache.commons.pool2.impl.GenericObjectPool
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
 ...

I am wondering how realistic it is to implement a connection pool that is serializable. Has anyone succeeded in doing this ?

Thank you.

895

asked May 26 '15 06:05

1 Answers

Below answer is wrong! I'm leaving the answer here for reference, but the answer is wrong for the following reason. socketPool is declared as a lazy val so it will get instantiated with each first request for access. Since the SocketPool case class is not Serializable, this means that it will get instantiated within each partition. Which makes the connection pool useless because we want to keep connections across partitions and RDDs. It makes no difference wether this is implemented as a companion object or as a case class. Bottom line is: the connection pool must be Serializable, and apache commons pool is not.

import java.io.PrintStream
import java.net.Socket

import org.apache.commons.pool2.{PooledObject, BasePooledObjectFactory}
import org.apache.commons.pool2.impl.{DefaultPooledObject, GenericObjectPool}
import org.apache.spark.streaming.dstream.DStream

/**
 * Publish a Spark stream to a socket.
 */
class PooledSocketStreamPublisher[T](host: String, port: Int)
  extends Serializable {

    lazy val socketPool = SocketPool(host, port)

    /**
     * Publish the stream to a socket.
     */
    def publishStream(stream: DStream[T], callback: (T) => String) = {
        stream.foreachRDD { rdd =>

            rdd.foreachPartition { partition =>

                val socket = socketPool.getSocket
                val out = new PrintStream(socket.getOutputStream)

                partition.foreach { event =>
                    val text : String = callback(event)
                    out.println(text)
                    out.flush()
                }

                out.close()
                socketPool.returnSocket(socket)

            }
        }
    }

}

class SocketFactory(host: String, port: Int) extends BasePooledObjectFactory[Socket] {

    def create(): Socket = {
        new Socket(host, port)
    }

    def wrap(socket: Socket): PooledObject[Socket] = {
        new DefaultPooledObject[Socket](socket)
    }

}

case class SocketPool(host: String, port: Int) {

    val socketPool = new GenericObjectPool[Socket](new SocketFactory(host, port))

    def getSocket: Socket = {
        socketPool.borrowObject
    }

    def returnSocket(socket: Socket) = {
        socketPool.returnObject(socket)
    }

}

which you can invoke as follows:

val socketStreamPublisher = new PooledSocketStreamPublisher[MyEvent](host = "10.10.30.101", port = 29009)
socketStreamPublisher.publishStream(myEventStream, (e: MyEvent) => Json.stringify(Json.toJson(e)))

174

answered Sep 26 '22 06:09

botkop

Related questions
                            
                                Spark with HBASE vs Spark with HDFS
                            
                                Creating Pyspark DataFrame column that coalesces two other Columns, why am I getting error of 'unicode' object has no attribute isNull?
                            
                                How spark handles object
                            
                                How to display a KeyValueGroupedDataset in Spark?
                            
                                How to continuously monitor a directory by using Spark Structured Streaming
                            
                                How to access an array element in dataframe column (scala) [duplicate]
                            
                                spark windowing function VS group by performance issue
                            
                                Operating RDD failed while setting Spark record delimiter with org.apache.hadoop.conf.Configuration
                            
                                Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both
                            
                                spark-submit EMR Step failing when submitted using boto3 client
                            
                                Count instances of combination of columns in spark dataframe using scala
                            
                                Calculate quantile on grouped data in spark Dataframe
                            
                                Whole-Stage Code Generation in Spark 2.0
                            
                                Spark Dataframe select based on column index
                            
                                Spark-scala : Check whether a S3 directory exists or not before reading it
                            
                                How to drop malformed rows while reading csv with schema Spark?
                            
                                Number of unique elements in all columns of a pyspark dataframe [duplicate]
                            
                                Fine grained transformation vs coarse grained transformations
                            
                                Inserting Analytic data from Spark to Postgres
                            
                                PySpark & MLLib: Class Probabilities of Random Forest Predictions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark-streaming and connection pool implementation

Tags:

apache-spark

spark-streaming

botkop

People also ask

1 Answers

botkop

Recent Activity

Donate For Us