I'm trying to test a part of my program which performs transformations on dataframes I want to test several different variations of these dataframe which rules out the option of reading a specific DF from a file And so my questions are: <ol> <li>Is there any good tutorial on how to perform unit testing with Spark and dataframes, especially regarding the dataframes creation?</li> <li>How can I create these different several lines dataframes without a lot of boilerplate and without reading these from a file?</li> <li>Are there any utility classes for checking for specific values inside a dataframe?</li> </ol> I obviously googled that before but could not find anything which was very useful. Among the more useful links I found were: <ul> <li>Running a basic unit test with a dataframe</li> <li>Custom made assertions with DF</li> </ul> It would be great if examples/tutorials are in Scala but I'll take whatever language you've got Thanks in advance

This link shows how we can programmatically create a data frame with schema. You can keep the data in separate traits and mix it in with your tests. For instance, <pre class="prettyprint"><code>// This example assumes CSV data. But same approach should work for other formats as well. trait TestData { val data1 = List( "this,is,valid,data", "this,is,in-valid,data", ) val data2 = ... } </code></pre> Then with ScalaTest, we can do something like this. <pre class="prettyprint"><code>class MyDFTest extends FlatSpec with Matchers { "method" should "perform this" in new TestData { // You can access your test data here. Use it to create the DataFrame. // Your test here. } } </code></pre> To create the DataFrame, you can have few util methods like below. <pre class="prettyprint"><code> def schema(types: Array[String], cols: Array[String]) = { val datatypes = types.map { case "String" => StringType case "Long" => LongType case "Double" => DoubleType // Add more types here based on your data. case _ => StringType } StructType(cols.indices.map(x => StructField(cols(x), datatypes(x))).toArray) } def df(data: List[String], types: Array[String], cols: Array[String]) = { val rdd = sc.parallelize(data) val parser = new CSVParser(',') val split = rdd.map(line => parser.parseLine(line)) val rdd = split.map(arr => Row(arr(0), arr(1), arr(2), arr(3))) sqlContext.createDataFrame(rdd, schema(types, cols)) } </code></pre> I am not aware of any utility classes for checking specific values in a DataFrame. But I think it should be simple to write one using the DataFrame APIs.

Unit testing with Spark dataframes

Tags:

unit-testing

scala

apache-spark

apache-spark-sql

spark-dataframe

I'm trying to test a part of my program which performs transformations on dataframes I want to test several different variations of these dataframe which rules out the option of reading a specific DF from a file

And so my questions are:

Is there any good tutorial on how to perform unit testing with Spark and dataframes, especially regarding the dataframes creation?
How can I create these different several lines dataframes without a lot of boilerplate and without reading these from a file?
Are there any utility classes for checking for specific values inside a dataframe?

I obviously googled that before but could not find anything which was very useful. Among the more useful links I found were:

Running a basic unit test with a dataframe
Custom made assertions with DF

It would be great if examples/tutorials are in Scala but I'll take whatever language you've got

Thanks in advance

433

asked Mar 17 '16 12:03

Gideon

1 Answers

This link shows how we can programmatically create a data frame with schema. You can keep the data in separate traits and mix it in with your tests. For instance,

// This example assumes CSV data. But same approach should work for other formats as well.

trait TestData {
  val data1 = List(
    "this,is,valid,data",
    "this,is,in-valid,data",
  )
  val data2 = ...  
}

Then with ScalaTest, we can do something like this.

class MyDFTest extends FlatSpec with Matchers {

  "method" should "perform this" in new TestData {
     // You can access your test data here. Use it to create the DataFrame.
     // Your test here.
  }
}

To create the DataFrame, you can have few util methods like below.

  def schema(types: Array[String], cols: Array[String]) = {
    val datatypes = types.map {
      case "String" => StringType
      case "Long" => LongType
      case "Double" => DoubleType
      // Add more types here based on your data.
      case _ => StringType
    }
    StructType(cols.indices.map(x => StructField(cols(x), datatypes(x))).toArray)
  }

  def df(data: List[String], types: Array[String], cols: Array[String]) = {
    val rdd = sc.parallelize(data)
    val parser = new CSVParser(',')
    val split = rdd.map(line => parser.parseLine(line))
    val rdd = split.map(arr => Row(arr(0), arr(1), arr(2), arr(3)))
    sqlContext.createDataFrame(rdd, schema(types, cols))
  }

I am not aware of any utility classes for checking specific values in a DataFrame. But I think it should be simple to write one using the DataFrame APIs.

138

answered Sep 21 '22 14:09

Jegan

Related questions
                            
                                Intellij Scala multiple import style formatting settings
                            
                                How to determine when Play! 2 must recompile all files?
                            
                                Hamcrest and ScalaTest
                            
                                scala Duration: "This class is not meant as a general purpose representation of time, it is optimized for the needs of scala.concurrent."
                            
                                Understanding parallelism in Spark and Scala
                            
                                No ClassTag available for MyClass.this.T for an abstract type
                            
                                How to structure a RESTful API with Spray.io?
                            
                                Why does AudioSystem.getMixerInfo() return different results under sbt vs Scala?
                            
                                |+| is a semigroup, why it needs a monoid implicit resolution
                            
                                SBT build, run main class from subproject on Compile and run
                            
                                head :: tail pattern matching for strings
                            
                                Is there any trick to use macros in the same file they are defined?
                            
                                Extracting Raw JSON as String inside a Spray POST route
                            
                                Scala Typeclasses with generics
                            
                                In Play 2 how to check if a JsValue variable is NULL?
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                What is the most concise way to increment a variable of type Short in Scala?
                            
                                spark submit java.lang.ClassNotFoundException
                            
                                Akka Http: Exceeded configured max-open-requests value of [32]
                            
                                Turn a side-effecting function returning Option into an Iterator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With