Streaming data store in hive using spark

Tags:

I am creating a application in which getting streaming data which goes into kafka and then on spark. consume the data, apply some login and then save processed data into the hive. velocity of data is very fast. I am getting 50K records in 1min. There is window of 1 min in spark streaming in which it process the data and save the data in the hive.

my question is for production prospective architecture is fine? If yes how can I save the streaming data into hive. What I am doing is, creating dataframe of 1 min window data and will save it in hive by using

results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")

I have not created the pipeline. Is it fine or I have to modified the architecture?

Thanks

503

asked Sep 06 '17 17:09

lucy

1 Answers

I would give it a try!

BUT kafka->spark->hive is not the optimal pipline for your usecase.

hive is normally based on hdfs which is not designed for small number of inserts/updates/selects. So your plan can end up in the following problems:
- many small files which ends in bad performance
- your window gets to small because it takes to long

Suggestion:

option 1: - use kafka just as buffer queue and design your pipeline like - kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table

Option 2:

kafka->flume/spark to hbase/kudu->batch spark to hive/impala

option 1 has no "realtime" analysis option. It depends on how often you run the batch spark

option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis. Kudu makes the architecture even easier.

Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.

But basicly it would work like the following:

xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")

answered Oct 08 '22 12:10

kf2

Related questions
                            
                                What is the difference between Future and future?
                            
                                Scala what is the difference between defining a method in the class instead on the companion object
                            
                                Could not find implicit value while using Context Bound
                            
                                scala case class too many fields
                            
                                How to retrieve the column having datatype as "list" from the table of Cassandra?
                            
                                An object with unapply working in middle of a case statement
                            
                                In Spark Streaming, is there a way to detect when a batch has finished?
                            
                                Scala override method with subclass as parameter type
                            
                                Error using reactivemongo 0.12.1 with play 2.5.X
                            
                                Unable to access file in relative path in Scala for test resource
                            
                                How to construct an actor together with its wrapper?
                            
                                How can I write and read an empty case class with play-json?
                            
                                How to map struct in DataFrame to case class?
                            
                                How to use spark quantilediscretizer on multiple columns
                            
                                Why do I need to use andThen in order to pattern match Futures?
                            
                                Unbounded table is spark structured streaming
                            
                                Scala - How to split the probability column (column of vectors) that we obtain when we fit the GMM model to the data in to two separate columns? [duplicate]
                            
                                SBT: Cross build project for two Scala versions with different dependencies
                            
                                Can Scala classes be used in Java
                            
                                Why Is Functor a Higher-Kinded type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Streaming data store in hive using spark

Tags:

scala

apache-spark

hadoop

hive

spark-streaming

lucy

People also ask

1 Answers

kf2

Recent Activity

Donate For Us