Spark : How to use mapPartition and create/close connection per partition

Tags:

So, I want to do certain operations on my spark DataFrame, write them to DB and create another DataFrame at the end. It looks like this :

import sqlContext.implicits._

val newDF = myDF.mapPartitions(
  iterator => {
    val conn = new DbConnection
    iterator.map(
       row => {
         addRowToBatch(row)
         convertRowToObject(row)
     })
    conn.writeTheBatchToDB()
    conn.close()
  })
  .toDF()

This gives me an error as mapPartitions expects return type of Iterator[NotInferedR], but here it is Unit. I know this is possible with forEachPartition, but I'd like to do the mapping also. Doing it separate would be an overhead (extra spark job). What to do?

Thanks!

444

asked Apr 11 '16 10:04

void

2 Answers

On most cases, eager consuming the iterator will result to execution failure if not slow down of jobs. Thus what I've done was to check if iterator is already empty then do the cleanup routines.

rdd.mapPartitions(itr => {
    val conn = new DbConnection
    itr.map(data => {
       val yourActualResult = // do something with your data and conn here
       if(itr.isEmpty) conn.close // close the connection
       yourActualResult
    })
})

Thought this as a spark problem at first but was a scala one actually. http://www.scala-lang.org/api/2.12.0/scala/collection/Iterator.html#isEmpty:Boolean

answered Sep 22 '22 07:09

dansuzuki

The last expression in the anonymous function implementation must be the return value:

import sqlContext.implicits._

val newDF = myDF.mapPartitions(
  iterator => {
    val conn = new DbConnection
    // using toList to force eager computation - make it happen now when connection is open
    val result = iterator.map(/* the same... */).toList
    conn.writeTheBatchToDB()
    conn.close()
    result.iterator
  }
).toDF()

answered Sep 22 '22 07:09

Tzach Zohar

Related questions
                            
                                Replace " with \"
                            
                                Function literal with multiple implicit arguments
                            
                                Catch in Java a exception thrown in Scala - unreachable catch block
                            
                                Scala, Erastothenes: Is there a straightforward way to replace a stream with an iteration?
                            
                                Typed Function and Currying in Scala
                            
                                Akka Http Route Test: Request was neither completed nor rejected within 1 second
                            
                                How to best handle Future.filter predicate is not satisfied type errors
                            
                                How to use LEFT and RIGHT keyword in SPARK SQL
                            
                                Oracle jdbc "createArray" throws "Unsupported feature" exception while trying to pass array to prepared statement [duplicate]
                            
                                DataFrame columns names conflict with .(dot)
                            
                                Intellij: Not a valid project ID:
                            
                                Why is method overloading not defined for different return types?
                            
                                Best way to score and sum in Scala?
                            
                                How to output {name} in xml of scala, not convert it?
                            
                                How much interoperability is there between C++ and Scala?
                            
                                What are standard Scala monads other than Option?
                            
                                Why the following scala code is valid?
                            
                                Scala equivalent of C++ static variable in a function
                            
                                Motivation for Scala underscore in terms of formal language theory and good style?
                            
                                Scala shutdown hooks never running?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark : How to use mapPartition and create/close connection per partition

Tags:

scala

apache-spark

rdd

void

People also ask

2 Answers

dansuzuki

Tzach Zohar

Recent Activity

Donate For Us