I'm using <code>Spark structured streaming</code> to process records read from <code>Kafka</code>. Here's what I'm trying to achieve: (a) Each record is a <code>Tuple2</code> of type <code>(Timestamp, DeviceId)</code>. (b) I've created a static <code>Dataset[DeviceId]</code> which contains the set of all valid device IDs (of type <code>DeviceId</code>) that are expected to be seen in the <code>Kafka</code> stream. (c) I need to write a <code>Spark structured streaming</code> query that <pre class="prettyprint"><code> (i) Groups records by their timestamp into 5-minute windows (ii) For each window, get the list of valid device IDs that were **not** seen in that window </code></pre> For example, let's say the list of all valid device IDs is <code>[A,B,C,D,E]</code> and the kafka records in a certain 5-minute window contain the device IDs <code>[A,B,E]</code>. Then, for that window, the list of unseen device IDs I'm looking for is <code>[C,D]</code>. Question <ol> <li>How can this query be written in Spark structured streaming? I tried using the <code>except()</code> and <code>join()</code> methods that <code>Dataset</code> exposes. However, they both threw a runtime exception complaining that neither of those operations is supported on a <code>streaming Dataset</code>.</li> </ol> Here's a snippet of my code: <pre class="prettyprint lang-scala prettyprint-override"><code>val validDeviceIds: Dataset[(DeviceId, Long)] = spark.createDataset[DeviceId](listOfAllDeviceIds.map(id => (id, 0L))) case class KafkaRecord(timestamp: TimestampType, deviceId: DeviceId) // kafkaRecs is the data stream from Kafka - type is Dataset[KafkaRecord] val deviceIdsSeen = kafkaRecs .withWatermark("timestamp", "5 minutes") .groupBy(window($"timestamp", "5 minutes", "5 minutes"), $"deviceId") .count() .map(row => (row.getLong(0), 1L)) .as[(Long, Long)] val unseenIds = deviceIdsSeen.join(validDeviceIds, Seq("_1"), "right_outer") .filter(row => row.isNullAt(1)) .map(row => row.getLong(0)) </code></pre> The last statement throws the following exception: <pre class="prettyprint lang-scala prettyprint-override"><code>Caused by: org.apache.spark.sql.AnalysisException: Right outer join with a streaming DataFrame/Dataset on the left is not supported;; </code></pre> Thanks in advance.

The situation with <code>join operations</code> in spark structured streaming looks as follows: streaming <code>DataFrames</code> can be joined with <code>static DataFrames</code> so in further create new <code>streaming DataFrames</code>. But <code>outer joins</code> between a <code>streaming</code> and a <code>static Datasets</code> are conditionally supported, while <code>right/left joins</code> with a <code>streaming Dataset</code> are not supported in general by structured streaming. As result, you faced with AnalysisException, which was thrown while you tried to make join static dataset with streaming dataset. As proof of my words, you can look at source code of spark, on this line exception is throwing which denotes that operation which you tried out is not supported. I tried to make join operation on <code>stream of DataFrames</code> with a static <code>DataFrames</code>. <pre class="prettyprint"><code>val streamingDf = sparkSession .readStream .format("kafka") .option("kafka.bootstrap.servers", "127.0.0.1:9092") .option("subscribe", "structured_topic") .load() val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val staticDf = Seq((1507831462 , 100)).toDF("Timestamp", "DeviceId") //Inner Join streamingDf.join(staticDf, "Timestamp") line.join(staticDf, "Timestamp") //Left Join streamingDf.join(staticDf, "Timestamp", "left_join") line.join(staticDf, "Timestamp", "left_join") </code></pre> As you see, in addition to consuming data from <code>Kafka</code> I read data from the socket launched via <code>nc</code> (netcat), it significantly simplifies life while you make testing of stream app. This approach works fine for me both with <code>Kafka</code> and <code>socket</code> as the source of data. Hope that help.

Spark structured streaming - join static dataset with streaming dataset

Tags:

scala

apache-spark

apache-spark-sql

apache-spark-dataset

spark-structured-streaming

I'm using Spark structured streaming to process records read from Kafka. Here's what I'm trying to achieve:

(a) Each record is a Tuple2 of type (Timestamp, DeviceId).

(b) I've created a static Dataset[DeviceId] which contains the set of all valid device IDs (of type DeviceId) that are expected to be seen in the Kafka stream.

 (i) Groups records by their timestamp into 5-minute windows
 (ii) For each window, get the list of valid device IDs that were **not** seen in that window

For example, let's say the list of all valid device IDs is [A,B,C,D,E] and the kafka records in a certain 5-minute window contain the device IDs [A,B,E]. Then, for that window, the list of unseen device IDs I'm looking for is [C,D].

Question

How can this query be written in Spark structured streaming? I tried using the except() and join() methods that Dataset exposes. However, they both threw a runtime exception complaining that neither of those operations is supported on a streaming Dataset.

Here's a snippet of my code:

val validDeviceIds: Dataset[(DeviceId, Long)] = spark.createDataset[DeviceId](listOfAllDeviceIds.map(id => (id, 0L))) 

case class KafkaRecord(timestamp: TimestampType, deviceId: DeviceId)

// kafkaRecs is the data stream from Kafka - type is Dataset[KafkaRecord]
val deviceIdsSeen = kafkaRecs
     .withWatermark("timestamp", "5 minutes")
     .groupBy(window($"timestamp", "5 minutes", "5 minutes"), $"deviceId")
     .count()
     .map(row => (row.getLong(0), 1L))
     .as[(Long, Long)]

val unseenIds = deviceIdsSeen.join(validDeviceIds, Seq("_1"), "right_outer")
     .filter(row => row.isNullAt(1))
     .map(row => row.getLong(0))

The last statement throws the following exception:

Caused by: org.apache.spark.sql.AnalysisException: Right outer join with a streaming DataFrame/Dataset on the left is not supported;;

Thanks in advance.

751

asked Oct 02 '17 22:10

jithinpt

2 Answers

The situation with join operations in spark structured streaming looks as follows: streaming DataFrames can be joined with static DataFrames so in further create new streaming DataFrames. But outer joins between a streaming and a static Datasets are conditionally supported, while right/left joins with a streaming Dataset are not supported in general by structured streaming. As result, you faced with AnalysisException, which was thrown while you tried to make join static dataset with streaming dataset. As proof of my words, you can look at source code of spark, on this line exception is throwing which denotes that operation which you tried out is not supported.

I tried to make join operation on stream of DataFrames with a static DataFrames.

val streamingDf = sparkSession
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "127.0.0.1:9092")
    .option("subscribe", "structured_topic")
    .load()

val lines = spark.readStream
      .format("socket")
      .option("host", "localhost")
      .option("port", 9999)
      .load()

val staticDf = Seq((1507831462 , 100)).toDF("Timestamp", "DeviceId")

//Inner Join
streamingDf.join(staticDf, "Timestamp")
line.join(staticDf, "Timestamp")

//Left Join
streamingDf.join(staticDf, "Timestamp", "left_join")
line.join(staticDf, "Timestamp", "left_join")

As you see, in addition to consuming data from Kafka I read data from the socket launched via nc (netcat), it significantly simplifies life while you make testing of stream app. This approach works fine for me both with Kafka and socket as the source of data.

Hope that help.

100

answered Sep 18 '22 13:09

Artem

Outer joins with streaming Dataset on opposite side are just not supported:

Outer joins between a streaming and a static Datasets are conditionally supported.

Full outer join with a streaming Dataset is not supported

Left outer join with a streaming Dataset on the right is not supported

Right outer join with a streaming Dataset on the left is not supported

If other Dataset is small, you can use Map or similar structure, broadcast, and reference it inside UserDefinedFunction.

val map: Broadcast[Map[T, U]] = ???
val lookup = udf((x: T) => map.value.get(x))

df.withColumn("foo", lookup($"_1"))

answered Sep 18 '22 13:09

user8762155

Related questions
                            
                                Change priority of items in a priority queue
                            
                                Asking for a type's kind in Scala vs Haskell
                            
                                How exactly does Play framework 2.0 controllers / Async work?
                            
                                ScalaTest: pass command line arguments to ScalaTest maven goal
                            
                                Override sbt default resolvers with authenticated repo?
                            
                                scala slick one-to-many collections
                            
                                From DataFrame to RDD[LabeledPoint]
                            
                                Iterating files in scala/java in O(1) open file descriptors
                            
                                Marshalling/unmarshalling XML in Scala
                            
                                Unit testing helper or non-interface traits in Scala
                            
                                Any halfway decent jdbc wrappers for Scala?
                            
                                SBT: Dependency On Other SBT Project Without Publishing
                            
                                Scala State monad - combining different state types
                            
                                Transforming Slick Streaming data and sending Chunked Response using Akka Http
                            
                                Java bytecode decompiler in IntelliJIDEA for Scala
                            
                                Scala Eclipse Autocomplete Broken?
                            
                                Is there a way to make the Scala REPL not stop with CTRL-C
                            
                                Setting up Scaladoc for IntelliJ
                            
                                Using Typesafe Config's ConfigFactory to set key setting in build.sbt?
                            
                                When extending a trait within a trait, what does 'super' refer to?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With