Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I apply schema with nullable = false to json reading

Tags:

apache-spark

I'm trying to write some test cases using json files for dataframes (whereas production would be parquet). I'm using spark-testing-base framework and I'm running into a snag when asserting data frames equal each other due to schema mismatches where the json schema always has nullable = true.

I'd like to be able to apply a schema with nullable = false to the json read.

I've written a small test case:

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
import org.scalatest.FunSuite

class TestJSON extends FunSuite with DataFrameSuiteBase {

  val expectedSchema = StructType(
    List(StructField("a", IntegerType, nullable = false),
         StructField("b", IntegerType, nullable = true))
  )
  test("testJSON") {
    val readJson =
      spark.read.schema(expectedSchema).json("src/test/resources/test.json")

    assert(readJson.schema == expectedSchema)

  }
}

And have a small test.json file of: {"a": 1, "b": 2} {"a": 1}

This returns an assertion failure of

StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true)) did not equal StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) ScalaTestFailureLocation: TestJSON$$anonfun$1 at (TestJSON.scala:15) Expected :StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) Actual
:StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))

Am I applying the schema the correct way? I'm using spark 2.2, scala 2.11.8

like image 946
Nurdin Premji Avatar asked Nov 22 '17 20:11

Nurdin Premji


People also ask

What is JSON Schema?

JSON Schema. JSON Schema is a grammar language for defining the structure, content, and (to some extent) semantics of JSON objects. It lets you specify metadata (data about data) about what an object’s properties mean and what values are valid for those properties. The result of applying the grammar language is a schema (a blueprint)...

How to read a JSON file with schema in spark?

By default Spark SQL infer schema while reading JSON file, but, we can ignore this and read a JSON with schema (user-defined) using spark.read.schema ("schema") method.

Is it possible to create a nullable schema in spark?

In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types. You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either.

How to validate a JSON document without its schema?

When we validate a JSON document without its Schema, we are validating only the syntax of the document. Syntactic validation guarantees only that the document is well-formed. The tools such as JSONLint, and the JSON parsers perform only Syntactic validations.


2 Answers

There is a workaround, where rather than reading the json directly from the file, read it using RDD then it applies the schema. Below is code:

val expectedSchema = StructType(
    List(StructField("a", IntegerType, nullable = false),
         StructField("b", IntegerType, nullable = true))
  )


  test("testJSON") {
    val jsonRdd =spark.sparkContext.textFile("src/test/resources/test.json")
    //val readJson =sparksession.read.schema(expectedSchema).json("src/test/resources/test.json")
    val readJson = spark.read.schema(expectedSchema).json(jsonRdd)
    readJson.printSchema()
    assert(readJson.schema == expectedSchema)

  }

The test case passes and the print schema result is :

root
 |-- a: integer (nullable = false)
 |-- b: integer (nullable = true)

There is JIRA https://issues.apache.org/jira/browse/SPARK-10848 with apache Spark for this issue, which they say is not a problem and said that:

This should be resolved in the latest file format refactoring in Spark 2.0. Please reopen it if you still hit the problem. Thanks!

If you are getting the error you can open the JIRA again. I tested in spark 2.1.0, and still see the same issue

like image 135
Amit Kumar Avatar answered Oct 12 '22 14:10

Amit Kumar


The workAround aboves ensures there is a correct schema, but null values are set to default ones. In my case when an Int does not exist in the json String it is set to 0.

like image 25
EY.Mohamed Avatar answered Oct 12 '22 14:10

EY.Mohamed