Is it possible to get the first n elements of every RDD in Spark Streaming?

Tags:

spark-streaming

When using Spark Streaming, is it possible to get the first n elements of every RDD in a DStream? In the real world, my stream consists of a number of geotagged events, and I want to take the 100 (or whatever) which are closest to a given point for further processing, but a simple example which shows what I'm trying to do is something like:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object take {
  def main(args: Array[String]) {

    val data = 1 to 10

    val sparkConf = new SparkConf().setAppName("Take");
    val streamingContext = new StreamingContext(sparkConf, Seconds(1))

    val rdd = streamingContext.sparkContext.makeRDD(data)
    val stream = new ConstantInputDStream(streamingContext, rdd)

    // In the real world, do a bunch of stuff which results in an ordered RDD

    // This obviously doesn't work
    // val filtered = stream.transform { _.take(5) }

    // In the real world, do some more processing on the DStream

    stream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

I understand I could pull the top n results back to the driver fairly easily, but that isn't something I want to do in this case as I need to do further processing on the RDD after having filtered it down.

369

asked Jul 21 '15 09:07

Philip Kendall

1 Answers

Why is it not working? I think your example is fine.

You should compute the distance for each event
Sort the events by distance with a number of partitions adapted to your amount of data
Take the first 100 events from each partition (so you'll shuffle a small part of the initial data), make the returned collection a new RDD with sparkContext.parallelize(data)
Sort again with only one partition so all the data is shuffled in the same dataset
Take the first 100 events, this is your top 100

The code for the sort is the same in step 2 and 4, you just change the number of partitions.

Step 1 is executed on the DStream, steps 2 to 5 are executed on the RDDs in a transform operation.

185

answered Sep 27 '22 22:09

Fabien COMTE

Related questions
                            
                                Does caching in spark streaming increase performance
                            
                                What operations of spark is processed in parallel?
                            
                                How to effectively read millions of rows from Cassandra?
                            
                                Combining Two Spark Streams On Key
                            
                                How To Convert List Object to JavaDStream Spark?
                            
                                Increasing Parallellism in Spark Executor without increasing Cores
                            
                                using DataSet.repartition in Spark 2 - several tasks handle more than one partition
                            
                                What is the difference between a "stateful" and "stateless" system?
                            
                                Spark Scheduling Within an Application : performance issue
                            
                                Spark Streaming with large number of streams and models used for analytical processing of RDDs
                            
                                Spark streaming + json4s-jackson dependency problems
                            
                                How to config checkpoint to redeploy spark streaming application?
                            
                                Spark + Kafka integration - mapping of Kafka partitions to RDD partitions
                            
                                Can a model be created on Spark batch and use it in Spark streaming?
                            
                                Spark Streaming from Kafka Consumer
                            
                                Kafka with Spark 2.1 Structured Streaming - cannot deserialize
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                How to parse JSON data in Scala?
                            
                                Spark Streaming on a S3 Directory
                            
                                Adding custom jars to pyspark in jupyter notebook

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With