Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming data store in hive using spark

I am creating a application in which getting streaming data which goes into kafka and then on spark. consume the data, apply some login and then save processed data into the hive. velocity of data is very fast. I am getting 50K records in 1min. There is window of 1 min in spark streaming in which it process the data and save the data in the hive.

my question is for production prospective architecture is fine? If yes how can I save the streaming data into hive. What I am doing is, creating dataframe of 1 min window data and will save it in hive by using

results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")

I have not created the pipeline. Is it fine or I have to modified the architecture?

Thanks

like image 503
lucy Avatar asked Sep 06 '17 17:09

lucy


People also ask

Can I use Spark for streaming data?

Spark Structured Streaming provides the same structured APIs (DataFrames and Datasets) as Spark so that you don't need to develop on or maintain two different technology stacks for batch and streaming. In addition, unified APIs make it easy to migrate your existing batch Spark jobs to streaming jobs.

What is data streaming How Spark streaming support data streaming?

Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases, and live dashboards.

How does Spark read streaming data?

Use readStream. format("socket") from Spark session object to read data from the socket and provide options host and port where you want to stream data from.


1 Answers

I would give it a try!

BUT kafka->spark->hive is not the optimal pipline for your usecase.

  1. hive is normally based on hdfs which is not designed for small number of inserts/updates/selects. So your plan can end up in the following problems:
    • many small files which ends in bad performance
    • your window gets to small because it takes to long

Suggestion:

option 1: - use kafka just as buffer queue and design your pipeline like - kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table

Option 2:

  • kafka->flume/spark to hbase/kudu->batch spark to hive/impala

option 1 has no "realtime" analysis option. It depends on how often you run the batch spark

option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis. Kudu makes the architecture even easier.

Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.

But basicly it would work like the following:

xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")

BR

like image 85
kf2 Avatar answered Oct 08 '22 12:10

kf2