Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I join a spark live stream with all the data collected by another stream during its entire life cycle?

I have two spark streams, in the first comes data related to products: their price to the supplier, the currency, their description, the supplier id. These data are enriched by the category, guessed by the analysis of the description and the price in dollars. Then they are saved in a parquet dataset.

The second stream contains data on the auctioning of these products, then the cost at which they were sold and the date.

Given the fact that a product can arrive in the first stream today and be sold in a year, how can I join the second stream with all the history contained in the parquet dataset of the first stream?

The result to be clear should be the average daily earnings per price range ...

like image 853
Claudio D'Alicandro Avatar asked Jan 17 '18 11:01

Claudio D'Alicandro


People also ask

Which Spark Streaming function is used to combine streams that are running in parallel?

Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams.

How does Spark read Streaming data?

Use readStream. format("socket") from Spark session object to read data from the socket and provide options host and port where you want to stream data from.

Do Spark Streaming programs run continuously?

Users specify a streaming computation by writing a batch computation (using Spark's DataFrame/Dataset API), and the engine automatically incrementalizes this computation (runs it continuously).

How does New data arriving in a stream get represented in Spark Streaming?

Spark Streaming Versus Structured Streaming Once it receives the input data, it divides it into batches for processing by the Spark Engine. DStream in Apache Spark is continuous streams of data. Spark polls the data after a configurable batch interval and creates a new RDD for the execution.

What is a streaming join in spark?

In Spark Structured Streaming, a streaming join is a streaming query that was described ( build) using the high-level streaming operators: Dataset.join Joins of a streaming query and a batch query ( stream-static joins) are stateless and no state management is required

What is spark Structured Streaming?

Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It extends the core Spark API to process real-time data from sources like Kafka, Flume. Recipe Objective: How to perform a stream-stream inner join on dataframe in Spark Structured Streaming?

What is Apache Spark Streaming?

Spark Streaming is an engine to process data in real-time from sources and output data to external storage systems. Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It extends the core Spark API to process real-time data from sources like Kafka, Flume.

How to plan stream-stream joins using incrementalexecution in spark?

At query planning, IncrementalExecution uses the StreamingJoinStrategy execution planning strategy for planning stream-stream joins as StreamingSymmetricHashJoinExec physical operators. Stream-stream Joins in the official documentation of Apache Spark for Structured Streaming


1 Answers

I found a possible solution with snappydata, using its mutable DataFrame:

https://www.snappydata.io/blog/how-mutable-dataframes-improve-join-performance-spark-sql

The reported example is very similar to the one described by claudio-dalicandro

like image 99
giorrrgio Avatar answered Oct 19 '22 01:10

giorrrgio