Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exception: 'writeStream' can be called only on streaming Dataset/DataFrame

Trying to create a test for spark data streaming writeStream function as shown below:

SparkSession spark = SparkSession.builder().master("local").appName("spark 
session").getOrCreate()

val lakeDF = spark.createDF(List(("hi")), List(("word", StringType, true)))

lakeDF.writeStream
  .trigger(Trigger.Once)
  .format("parquet")
  .option("checkpointLocation", checkpointPath)
  .start(dataPath)

But I am getting following exception: org.apache.spark.sql.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFrame;

I am very new to spark streaming, please let me know how can i create a streaming dataframe/convert the above regular dataframe into streaming dataframe for my test suite.

like image 327
Dhruvajyoti Chatterjee Avatar asked Jul 18 '18 17:07

Dhruvajyoti Chatterjee


People also ask

Which of the following are supported in Spark structured streaming?

Spark Streaming is a processing engine to process data in real-time from sources and output data to external storage systems. Spark Streaming has 3 major components: input sources, streaming engine, and sink. Input sources generate data like Kafka, Flume, HDFS/S3, etc.

Which method is used to count the streaming words and aggregate the previous data?

Streaming – Complete Output Mode This mode is used only when you have streaming aggregated data. One example would be counting the words on streaming data and aggregating with previous data and output the results to sink.

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.

In which of the following modes output can be configured for structured streaming programming model?

The output can be defined in a different mode: Complete Mode - The entire updated Result Table will be written to the external storage.


1 Answers

In Spark Structured Streaming dataframes/datasets are created out stream using readStream on SparkSession. If the dataframe/dataset are not created using stream then you are not allowed store using writeStream.

So create the dataframes/datasets using readStream and store the dataframes/datasets using writeStream

 val kafkaStream = sparkSession.readStream.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker-hostname:port")
.option("subscribe", "topicname")
.load()
like image 124
Naga Avatar answered Sep 26 '22 06:09

Naga