A Spark job makes a remote web service for every element in an RDD. A simple implementation might look something like this: <pre class="prettyprint"><code>def webServiceCall(url: String) = scala.io.Source.fromURL(url).mkString rdd2 = rdd1.map(x => webServiceCall(x.field1)) </code></pre> (The above example has been kept simple and does not handle timeouts). There is no interdependency between any of the results for different elements of the RDD. Would the above be improved by using Futures to optimise performance by making parallel calls to the web service for each element of the RDD? Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel? If the above can be optimized by using Futures, does anyone have some code examples showing the correct way to use Futures within a function passed to a Spark RDD. Thanks

<blockquote> Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel? </blockquote> It doesn't. Spark parallelizes tasks at the partition level but by default every partition is processed sequentially in a single thread. <blockquote> Would the above be improved by using Futures </blockquote> It could be an improvement but is quite hard to do it right. In particular: <ul> <li>every <code>Future</code> has to be completed in the same stage before any reshuffle takes place.</li> <li>given lazy nature of the <code>Iterators</code> used to expose partition data you cannot do it high level primitives like <code>map</code> (see for example Spark job with Async HTTP call).</li> <li>you can build your custom logic using <code>mapPartitions</code> but then you have to deal with all the consequences of non-lazy partition evaluation.</li> </ul>

Using Futures within Spark

Tags:

scala

apache-spark

A Spark job makes a remote web service for every element in an RDD. A simple implementation might look something like this:

def webServiceCall(url: String) = scala.io.Source.fromURL(url).mkString
rdd2 = rdd1.map(x => webServiceCall(x.field1))

(The above example has been kept simple and does not handle timeouts).

There is no interdependency between any of the results for different elements of the RDD.

Would the above be improved by using Futures to optimise performance by making parallel calls to the web service for each element of the RDD? Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?

If the above can be optimized by using Futures, does anyone have some code examples showing the correct way to use Futures within a function passed to a Spark RDD.

Thanks

589

asked May 27 '16 08:05

user1052610

2 Answers

Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?

It doesn't. Spark parallelizes tasks at the partition level but by default every partition is processed sequentially in a single thread.

Would the above be improved by using Futures

It could be an improvement but is quite hard to do it right. In particular:

every Future has to be completed in the same stage before any reshuffle takes place.
given lazy nature of the Iterators used to expose partition data you cannot do it high level primitives like map (see for example Spark job with Async HTTP call).
you can build your custom logic using mapPartitions but then you have to deal with all the consequences of non-lazy partition evaluation.

189

answered Sep 20 '22 01:09

zero323

I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.

// Break down your huge workload into smaller chunks, in this case huge query string is broken 
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)

// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
    val task = makeHttpCall(query) // Method returns http call response as a Future[String]
    task.recover { 
        case ex => logger.error("recover: " + ex.printStackTrace()) }
    task onFailure {
        case t => logger.error("execution failed: " + t.getMessage) }
    task
})

// Note:: Http call is still not invoked, you are including this as part of the lineage

// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved

val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
   val searchFuture: Future[Iterator[String]] = Future sequence f
   Await.result(searchFuture, threadWaitTime.seconds)
}

// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage. 
// When you perform any action on that Rdd, then at that point, 
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and 
// collect those data in a single rdd.

I'm reposting it from my original answer here

answered Sep 20 '22 01:09

raksja

Related questions
                            
                                Macro to access source code text at runtime
                            
                                Scala future sequence and timeout handling
                            
                                scala value toInt is not a member of Any
                            
                                How to use Enums in Scala Slick?
                            
                                Fast test execution in a playframework fake application
                            
                                Find type class instances for Shapeless HList
                            
                                In SBT 0.13, does scalaVersion still control the version of scala used for compile, run and test?
                            
                                How to call a step from another step in Cucumber-JVM
                            
                                Is this really the way to pass void functions to Scala methods from Java?
                            
                                why the first type parameter is defined as contravariant in Function1[-A, +B]?
                            
                                Scala: Receiving Server-Sent-Events
                            
                                How to set scalacOptions used by SBT when compiling Build.scala?
                            
                                Apache Spark -- MlLib -- Collaborative filtering
                            
                                How to convert case class to JSON in Play framework 2.3.x (Scala)?
                            
                                Scala - avoid too complex nested pattern matching
                            
                                How to configure Slick 3.0.0 for Postgres DB (either with Hikari or without) Typesafe Play conf
                            
                                How to run gatling from code
                            
                                What is the equivalent to scala.util.Try in pyspark?
                            
                                Akka SLF4J and logback in Scala
                            
                                how scala folding works?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With