I'm working with Spark's Structured Streaming (2.2.1), using Kafka to receive data from sensors every 60 seconds. I'm having troubles wrapping my head around how to package this Kafka Data to be able to process is correctly as it comes. I need to be able to do some calculations as the data comes in with Kafka. My issue is unpacking the JSON data which is coming from Kafka into datasets I can work with <h3>Data</h3> A simplified data looks something like this: <pre class="prettyprint lang-json prettyprint-override"><code>{ id: 1, timestamp: "timestamp" pump: { current: 1.0, flow: 20.0 torque: 5.0 }, reactors: [ { id: 1, status: 200, }, { id: 2, status: 300, } ], settings: { pumpTimer: 20.0, reactorStatusTimer: 200.0 } } </code></pre> In order to be able to work with this is Spark, I've created some case class structures for each of these: <pre class="prettyprint lang-scala prettyprint-override"><code>// First, general package case class RawData(id: String, timestamp: String, pump: String, reactors: Array[String], settings: String) // Each of the objects from the data case class Pump(current: Float, flow: Float, torque: Float) case class Reactor(id: Int, status: Int) case class Settings(oos: Boolean, pumpTimer: Float, reactorStatusTimer: Float) </code></pre> And generating the schema using: <pre class="prettyprint lang-scala prettyprint-override"><code>val rawDataSchema = Encoders.product[RawData].schema </code></pre> <h3>Raw data to Spark Schema</h3> Firstly I put the 'value' field from Kafka into my general schema: <pre class="prettyprint lang-scala prettyprint-override"><code>val rawDataSet = df.select($"value" cast "string" as "json") .select(from_json($"json", rawDataSchema)) .select("data.*").as[RawData] </code></pre> Using this rawDataSet, I can package each of the individual objects into datasets. <pre class="prettyprint lang-scala prettyprint-override"><code>val pump = rawDataSet.select(from_json($"pump", pumpSchema) as 'pumpData) .select("pumpData.*").as[Pump] val settings = rawDataSet.select(from_json($"settings", settingsSchema) as 'settingsData) .select("settingsData.*").as[Settings] </code></pre> And this gives me nice and clean datasets per JSON object. <h3>Working with the data</h3> Here are my issues, if I want to for example compare or calculate some values between the two datasets for Settings and Pump, JOIN is not working using Structured Streaming. <pre class="prettyprint lang-scala prettyprint-override"><code>val joinedData = pump.join(settings) </code></pre> Error: <pre class="prettyprint"><code>Exception in thread "main" org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported; </code></pre> Is my approach for this all wrong? Or is there any recommendations for alternative ways to handle this? Thanks

I’ll answer my own question with my now working solution Instead of making case classes for each of the objects within the JSON, I could connect these together as one case class with nested objects as such: <pre class="prettyprint lang-scala prettyprint-override"><code>case class RawData( id: String, timestamp: String, pump: Pump, reactors: Array[Reactor], settings: Settings ) case class Pump(current: Float, flow: Float, torque: Float) case class Reactor(id: Int, status: Int) case class Settings(oos: Boolean, pumpTimer: Float, reactorStatusTimer: Float) </code></pre> To make this into a usable Dataset, I could simply call <pre class="prettyprint lang-scala prettyprint-override"><code>val rawDataset = df.select($"value" cast "string" as "json") .select(from_json($"json", Encoders.product[RawData].schema) as 'data) .select("data.*").as[RawData] .withColumn("reactor", explode($"reactors")) // Handles the array of reactors, making one row in the dataset per reactor. </code></pre> After having processed the JSON and put it into my define schema, I could select each specific sensor like this: <pre class="prettyprint lang-scala prettyprint-override"><code>val tester = rawDataset.select($"pump.current", $”settings.pumpTimer”) </code></pre> Thank you user6910411 for pointing me in the right direction

Structured Streaming and Splitting nested data into multiple datasets

Tags:

apache-kafka

apache-spark

apache-spark-sql

spark-structured-streaming

I'm working with Spark's Structured Streaming (2.2.1), using Kafka to receive data from sensors every 60 seconds. I'm having troubles wrapping my head around how to package this Kafka Data to be able to process is correctly as it comes.

I need to be able to do some calculations as the data comes in with Kafka.

My issue is unpacking the JSON data which is coming from Kafka into datasets I can work with

Data

A simplified data looks something like this:

{
  id: 1,
  timestamp: "timestamp"
  pump: {
    current: 1.0,
    flow: 20.0
    torque: 5.0
  },
  reactors: [
    {
      id: 1,
      status: 200,
    },

    {
      id: 2,
      status: 300,
    }
  ],
  settings: {
    pumpTimer: 20.0,
    reactorStatusTimer: 200.0
  }
}

In order to be able to work with this is Spark, I've created some case class structures for each of these:

// First, general package
case class RawData(id: String, timestamp: String, pump: String, reactors: Array[String], settings: String)
// Each of the objects from the data
case class Pump(current: Float, flow: Float, torque: Float)
case class Reactor(id: Int, status: Int)
case class Settings(oos: Boolean, pumpTimer: Float, reactorStatusTimer: Float)

And generating the schema using:

val rawDataSchema = Encoders.product[RawData].schema

Raw data to Spark Schema

Firstly I put the 'value' field from Kafka into my general schema:

val rawDataSet = df.select($"value" cast "string" as "json")
  .select(from_json($"json", rawDataSchema))
  .select("data.*").as[RawData]

Using this rawDataSet, I can package each of the individual objects into datasets.

val pump = rawDataSet.select(from_json($"pump", pumpSchema) as 'pumpData)
  .select("pumpData.*").as[Pump]

val settings = rawDataSet.select(from_json($"settings", settingsSchema) as 'settingsData)
  .select("settingsData.*").as[Settings]

And this gives me nice and clean datasets per JSON object.

Working with the data

Here are my issues, if I want to for example compare or calculate some values between the two datasets for Settings and Pump, JOIN is not working using Structured Streaming.

val joinedData = pump.join(settings)

Error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;

Is my approach for this all wrong? Or is there any recommendations for alternative ways to handle this?

Thanks

890

asked Apr 01 '18 10:04

Martin

1 Answers

I’ll answer my own question with my now working solution

Instead of making case classes for each of the objects within the JSON, I could connect these together as one case class with nested objects as such:

case class RawData(
  id: String, 
  timestamp: String, 
  pump: Pump, 
  reactors: Array[Reactor], 
  settings: Settings
)

case class Pump(current: Float, flow: Float, torque: Float)
case class Reactor(id: Int, status: Int)
case class Settings(oos: Boolean, pumpTimer: Float, reactorStatusTimer: Float)

To make this into a usable Dataset, I could simply call

val rawDataset = df.select($"value" cast "string" as "json")
  .select(from_json($"json", Encoders.product[RawData].schema) as 'data)
  .select("data.*").as[RawData]
  .withColumn("reactor", explode($"reactors")) // Handles the array of reactors, making one row in the dataset per reactor.

After having processed the JSON and put it into my define schema, I could select each specific sensor like this:

val tester = rawDataset.select($"pump.current", $”settings.pumpTimer”)

Thank you user6910411 for pointing me in the right direction

121

answered Jan 03 '23 01:01

Martin

Related questions
                            
                                Spark Streaming: Broadcast variables, java.lang.ClassCastException
                            
                                How to run custom Python script on Jupyter Notebook launch (to boot Spark)?
                            
                                saveToCassandra with spark-cassandra connector throws java.lang.ClassCastException
                            
                                How to load a PMML model?
                            
                                How to distribute xgboost module for use in spark?
                            
                                how to get two-hop neighbors in spark-graphx?
                            
                                How a Spark executor runs multiple tasks?
                            
                                Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
                            
                                Can we use SizeEstimator.estimate for estimating size of RDD/DataFrame?
                            
                                Slow Parquet write to HDFS using Spark
                            
                                Spark performance enhancements by storing sorted Parquet files
                            
                                Spark workers stopped after driver commanded a shutdown
                            
                                How to check if all records for a given key are in the same partition already?
                            
                                approxQuantile give incorrect Median in Spark (Scala)?
                            
                                Setting "spark.memory.storageFraction" in Spark does not work
                            
                                Method to get number of cores for a executor on a task node?
                            
                                Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema
                            
                                Spark, Incorrect behaviour when throwing SparkException in EMR
                            
                                Pyspark : Cumulative Sum with reset condition
                            
                                Python Spark- How to output empty DataFrame to csv file (Only output header)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Structured Streaming and Splitting nested data into multiple datasets

Tags:

apache-kafka

apache-spark

apache-spark-sql

spark-structured-streaming

Data

Raw data to Spark Schema

Working with the data

Martin

People also ask

1 Answers

Martin

Recent Activity

Donate For Us