How do I apply schema with nullable = false to json reading

Tags:

apache-spark

I'm trying to write some test cases using json files for dataframes (whereas production would be parquet). I'm using spark-testing-base framework and I'm running into a snag when asserting data frames equal each other due to schema mismatches where the json schema always has nullable = true.

I'd like to be able to apply a schema with nullable = false to the json read.

I've written a small test case:

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
import org.scalatest.FunSuite

class TestJSON extends FunSuite with DataFrameSuiteBase {

  val expectedSchema = StructType(
    List(StructField("a", IntegerType, nullable = false),
         StructField("b", IntegerType, nullable = true))
  )
  test("testJSON") {
    val readJson =
      spark.read.schema(expectedSchema).json("src/test/resources/test.json")

    assert(readJson.schema == expectedSchema)

  }
}

And have a small test.json file of: {"a": 1, "b": 2} {"a": 1}

This returns an assertion failure of

StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true)) did not equal StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) ScalaTestFailureLocation: TestJSON$$anonfun$1 at (TestJSON.scala:15) Expected :StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) Actual
:StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))

Am I applying the schema the correct way? I'm using spark 2.2, scala 2.11.8

946

asked Nov 22 '17 20:11

Nurdin Premji

2 Answers

There is a workaround, where rather than reading the json directly from the file, read it using RDD then it applies the schema. Below is code:

val expectedSchema = StructType(
    List(StructField("a", IntegerType, nullable = false),
         StructField("b", IntegerType, nullable = true))
  )


  test("testJSON") {
    val jsonRdd =spark.sparkContext.textFile("src/test/resources/test.json")
    //val readJson =sparksession.read.schema(expectedSchema).json("src/test/resources/test.json")
    val readJson = spark.read.schema(expectedSchema).json(jsonRdd)
    readJson.printSchema()
    assert(readJson.schema == expectedSchema)

  }

The test case passes and the print schema result is :

root
 |-- a: integer (nullable = false)
 |-- b: integer (nullable = true)

There is JIRA https://issues.apache.org/jira/browse/SPARK-10848 with apache Spark for this issue, which they say is not a problem and said that:

This should be resolved in the latest file format refactoring in Spark 2.0. Please reopen it if you still hit the problem. Thanks!

If you are getting the error you can open the JIRA again. I tested in spark 2.1.0, and still see the same issue

135

answered Oct 12 '22 14:10

Amit Kumar

The workAround aboves ensures there is a correct schema, but null values are set to default ones. In my case when an Int does not exist in the json String it is set to 0.

answered Oct 12 '22 14:10

EY.Mohamed

Related questions
                            
                                How to connect Zeppelin to Spark 1.5 built from the sources?
                            
                                Merging multiple rows in a spark dataframe into a single row
                            
                                Spark: difference of semantics between reduce and reduceByKey
                            
                                Is Spark's KMeans unable to handle bigdata?
                            
                                Spark dataframe to arrow
                            
                                Is there a difference between OUTER & FULL_OUTER in Spark SQL?
                            
                                Calculate Cosine Similarity Spark Dataframe
                            
                                SparkSession: ActiveSession vs DefaultSession
                            
                                how to implement spark sql pagination query
                            
                                How to recommend top 10 products in Spark ALS for all the users?
                            
                                Hive UDF for selecting all except some columns
                            
                                pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
                            
                                How does Spark parallelize the processing of a 1TB file?
                            
                                How to retrieve Metrics like Output Size and Records Written from Spark UI?
                            
                                How does computing table stats in hive or impala speed up queries in Spark SQL?
                            
                                Spark Shuffle - How workers know where to pull data from
                            
                                pyspark csv at url to dataframe, without writing to disk
                            
                                Spark: Order of column arguments in repartition vs partitionBy
                            
                                Spark Streaming Accumulated Word Count
                            
                                Saving to parquet subpartition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With