How do you perform blocking IO in apache spark job?

Tags:

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?

val values: Future[RDD[Double]] = Future sequence tasks

I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.

I just wonder, if anyone had such a problem, and how did you solve it? What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.

Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.

It's interesting to know, how do you cope with such a challenge? Thanks.

708

asked Sep 08 '14 13:09

Dr.Khu

2 Answers

Here is answer to my own question:

val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
  future {
    // call native code
  }
}

val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
  val searchFuture: Future[Iterator[Object]] = Future sequence f
  Await result (searchFuture, JOB_TIMEOUT)
}

The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.

'values' collection contains the data, that is returned from the native code and that work is done across the cluster.

172

answered Nov 15 '22 21:11

Dr.Khu

Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.

It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.

If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:

val callRemote(data:Data):Future[Double] = ...

val inputData:RDD[Data] = ...

val transformed:RDD[Future[Double]] = inputData.map(callRemote)

And then carry on from there, computing on your Future[Double] objects.

If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.

val remoteParallelism:Int = 100 // some constant

val callRemoteBlocking(data:Data):Double = ...

val inputData:RDD[Data] = ...

val transformed:RDD[Double] = inputData.
  coalesce(remoteParallelism).
  map(callRemoteBlocking)

Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.

A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.

answered Nov 15 '22 22:11

DPM

Related questions
                            
                                How to declare and access local variables in scala template in play framework?
                            
                                Strange behavior of Set4 in scala 2.9.1?
                            
                                Is it possible to find a common supertype on type-system level in Scala?
                            
                                Easy Scala Serialization?
                            
                                Scala by Example - trait type parameter with context bounds mistake?
                            
                                How can scala applications for android be reduced in file size?
                            
                                Showing inferred types of Scala expressions
                            
                                How to add a factory method to an existing Java class in Scala
                            
                                Scala eclipse plugin java.lang.ClassNotFoundException
                            
                                Scala Case Classes vs. Protocol Buffers with Akka over the network
                            
                                How to setup eclipse-ide work on the playframework 2.0
                            
                                Why won't the VisualVM Profiler profile my Scala console application?
                            
                                How to navigate up inside a HUET Zipper
                            
                                Fetch object by plain SQL query with SORM
                            
                                Scala projections in Slick for only one column
                            
                                How to handle exception with ask pattern and supervision
                            
                                Cannot prove that Unit <:< (T, U)
                            
                                Using Scala continuations with while loops
                            
                                Scala worksheet not working in Intellij
                            
                                Are imports and conditionals in Play's routes file possible?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you perform blocking IO in apache spark job?

Tags:

parallel-processing

scala

apache-spark

Dr.Khu

People also ask

2 Answers

Dr.Khu

DPM

Recent Activity

Donate For Us