How to write integration tests for Sparks new Structured Streaming?

Tags:

Trying to test Spark Structured Streams ...and failing... how can I test them properly?

I followed the general Spark testing question from here, and my closest try was [1] looking something like:

import simpleSparkTest.SparkSessionTestWrapper
import org.scalatest.FunSpec  
import org.apache.spark.sql.types.{StringType, IntegerType, DoubleType, StructType, DateType}
import org.apache.spark.sql.streaming.OutputMode

class StructuredStreamingSpec extends FunSpec with SparkSessionTestWrapper {

  describe("Structured Streaming") {

    it("Read file from system") {

      val schema = new StructType()
        .add("station_id", IntegerType)
        .add("name", StringType)
        .add("lat", DoubleType)
        .add("long", DoubleType)
        .add("dockcount", IntegerType)
        .add("landmark", StringType)
        .add("installation", DateType)

      val sourceDF = spark.readStream
        .option("header", "true")
        .schema(schema)
        .csv("/Spark-The-Definitive-Guide/data/bike-data/201508_station_data.csv")
        .coalesce(1)

      val countSource = sourceDF.count()

      val query = sourceDF.writeStream
        .format("memory")
        .queryName("Output")
        .outputMode(OutputMode.Append())
        .start()
        .processAllAvailable()

      assert(countSource === 70)
    }

  }

}

Sadly it always fails with org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()

I also found this issue at the spark-testing-base repo and wonder if it is even possible to test Spark Structured Streaming?

I want to have integration test and maybe even use Kafka on top for testing Checkpointing or specific corrupt data scenarios. Can someone help me out?

Last but not least, I figured the version maybe also a constraint - I currently develop against 2.1.0 which I need because of Azure HDInsight deployment options. Self hosted is an option if this is the drag.

832

asked Mar 27 '18 19:03

lony

Video Answer

1 Answers

Did you solve this?

You are doing a count() on a streaming dataframe before starting the execution by calling start(). If you want a count, how about doing this?

  sourceDF.writeStream
    .format("memory")
    .queryName("Output")
    .outputMode(OutputMode.Append())
    .start()
    .processAllAvailable()

  val results: List[Row] = spark.sql("select * from Output").collectAsList()
  assert(results.size() === 70)

135

answered Oct 15 '22 08:10

Sumeeth

Related questions
                            
                                Spark2 - LogisticRegression training finished but the result is not converged because: line search failed
                            
                                Access files in resources directory in JAR from Apache Spark Streaming context
                            
                                The usage of serializable object: Caused by: java.io.NotSerializableException
                            
                                Windows error while running standalone pyspark
                            
                                IllegalAccessError in Spark caused by async-http-client
                            
                                Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]
                            
                                rank() function usage in Spark SQL
                            
                                Spark reading from Postgres JDBC table slow
                            
                                Scala Spark connect to remote cluster
                            
                                Column features must be of type org.apache.spark.ml.linalg.VectorUDT
                            
                                failing to connect to spark driver when submitting job to spark in yarn mode
                            
                                How to convert the group by function to data frame
                            
                                Ubuntu install apache spark via apt-get
                            
                                How can you update values in a dataset?
                            
                                How to add sparse vectors after group by, using Spark SQL?
                            
                                Understanding Apache Spark RDD task serialization
                            
                                Why does Kafka Direct Stream create a new decoder for every message?
                            
                                How to compute statistics on a streaming dataframe for different type of columns in a single query?
                            
                                ArrayIndexOutOfBoundsException when reading csv file in spark
                            
                                Difference between createOrReplaceGlobalTempView and createOrReplaceTempView

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write integration tests for Sparks new Structured Streaming?

Tags:

integration-testing

apache-spark

scalatest

lony

People also ask

Video Answer

1 Answers

Sumeeth

Recent Activity

Donate For Us