How can I join a spark live stream with all the data collected by another stream during its entire life cycle?

Tags:

I have two spark streams, in the first comes data related to products: their price to the supplier, the currency, their description, the supplier id. These data are enriched by the category, guessed by the analysis of the description and the price in dollars. Then they are saved in a parquet dataset.

The second stream contains data on the auctioning of these products, then the cost at which they were sold and the date.

Given the fact that a product can arrive in the first stream today and be sold in a year, how can I join the second stream with all the history contained in the parquet dataset of the first stream?

The result to be clear should be the average daily earnings per price range ...

853

asked Jan 17 '18 11:01

Claudio D'Alicandro

1 Answers

I found a possible solution with snappydata, using its mutable DataFrame:

https://www.snappydata.io/blog/how-mutable-dataframes-improve-join-performance-spark-sql

The reported example is very similar to the one described by claudio-dalicandro

answered Oct 19 '22 01:10

giorrrgio

Related questions
                            
                                How to let Spark serialize an object using Kryo?
                            
                                Spark job failing when calling first() in PySpark
                            
                                Apache Spark ALS recommendations approach
                            
                                In Apache Spark SQL, How to close metastore connection from HiveContext
                            
                                must build Spark with Hive (spark 1.5.0)
                            
                                Spark partitionBy much slower than without it
                            
                                Combining PyCharm, Spark and Jupyter
                            
                                How to enable streaming from Cassandra to Spark?
                            
                                pySpark: Save ML Model
                            
                                Spark Job submitted - Waiting (TaskSchedulerImpl : Initial job not accepted)
                            
                                Spark performance tuning - number of executors vs number for cores
                            
                                Spark Dataframe Maximum Column Count
                            
                                Run Spark-shell with error :SparkContext: Error initializing SparkContext
                            
                                Spark num-executors
                            
                                Spark SQL: INSERT INTO statement syntax
                            
                                Cannot create temp dir with proper permission: /mnt1/s3
                            
                                Pyspark 1.6 - Aliasing columns after pivoting with multiple aggregates
                            
                                Apache Spark read file as a stream from HDFS
                            
                                "GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)
                            
                                spark 2.1.1 : Parsed JSON values do not match with class constructor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I join a spark live stream with all the data collected by another stream during its entire life cycle?

Tags:

apache-spark

pyspark

amazon-kinesis

apache-spark-2.0

spark-streaming

Claudio D'Alicandro

People also ask

1 Answers

giorrrgio

Recent Activity

Donate For Us