Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does it make sense to run Spark job for its side effects?

Tags:

apache-spark

I want to run a Spark job, where each RDD is responsible for sending certain traffic over a network connection. The return value from each RDD is not very important, but I could perhaps ask them to return the number of messages sent. The important part is the network traffic, which is basically a side effect for running a function over each RDD.

Is it a good idea to perform the above task in Spark?

I'm trying to simulate network traffic from multiple sources to test the data collection infrastructure on the receiving end. I could instead manually setup multiple machines to run the sender, but I thought it'd be nice if I could take advantage of Spark's existing distributed framework.

However, it seems like Spark is designed for programs to "compute" and then "return" something, not for programs to run for their side effects. I'm not sure if this is a good idea, and would appreciate input from others.

To be clear, I'm thinking of something like the following

IDs = sc.parallelize(range(0, n))

def f(x):
    for i in range(0,100):
        message = make_message(x, i)
        SEND_OVER_NETWORK(message)
    return (x, 100)

IDsOne = IDs.map(f)
counts = IDsOne.reduceByKey(add)

for (ID, count) in counts.collect():
    print ("%i ran %i times" % (ID, count))
like image 837
user3240688 Avatar asked Oct 19 '22 23:10

user3240688


1 Answers

Generally speaking it doesn't make sense:

  1. Spark is a heavyweight framework. At its core there is this huge machinery which ensures that data is properly distributed, collected, recovery is possible and so on. It has a significant impact on overall performance and latency but doesn't provide any benefits in case of side-effects-only tasks
  2. Spark concurrency has a relatively low granularity with partition being the main unit of concurrency. At this level processing becomes synchronous. You cannot move on to the next partition before you finish the current one.

    Lets say in your case there is a single slow SEND_OVER_NETWORK. If you use map you pretty much block processing on a whole partition. You can go at the lower level with mapPartitions, make SEND_OVER_NETWORK asynchronous, and return only when a whole partition has been processed. It is better but still suboptimal.

    You can increase number of partitions, but it means higher bookkeeping overhead so at the end of the day you can make situation worse not better.

  3. Spark API is designed mostly for side effects free operations. It makes it hard to express operations which doesn't fit into this model.

    What is arguably more important is that Spark guarantees only that each operation is executed at-least-once (lets ignore zero-times if rdd is never materialized). If application requires for example exactly-once semantics things become tricky especially when you consider point 2.

    It is possible to keep track of local state for each partition outside the main Spark logic but if you get there it is a really good sign that Spark is not the right tool.

like image 74
zero323 Avatar answered Oct 22 '22 02:10

zero323