Spark SQL - reading csv with schema

Tags:

I encountered the problem while trying to use Spark for simple reading a CSV file. After such operation I would like to ensure that:

the data types are correct (with using provided schema)
the headers are correct against provided schema

That's the code I use and have problems with:

val schema = Encoders.product[T].schema
val df = spark.read
 .schema(schema)
 .option("header", "true")
 .csv(fileName)

The type T is of type Product, i. e. case class. This works but it doesn't check if column names are correct, so I can give another file and as long as the data types are correct the error doesn't occur and I am unaware that the user provided the wrong file but by some coincidence with correct data types with the proper ordering.

I tried to use options which infers the schema and then use .as[T] method on Dataset, but in case when any column other than String contains only null it's interpreted by Spark as String column, but in my schema it is Integer. So cast exception occurs, but column names has been checked all right.

To summarize: I found solution that I can ensure correct data types but no headers and other solution that I can validate headers but have problems with data types. Is there any way to achieve both, i. e. complete validation of headers and types?

I am using Spark 2.2.0.

881

asked Oct 13 '17 08:10

Hooberd

1 Answers

Looks like you'll have to do it yourself by reading the file header twice.

Looking at Spark's code, the inferred header is completely ignored (never actually read) if a user supplies their own schema, so there's no way of making Spark fail on such an inconsistency.

To perform this comparison yourself:

val schema = Encoders.product[T].schema

// read the actual schema; This shouldn't be too expensive as Spark's
// laziness would avoid actually reading the entire file 
val fileSchema = spark.read
  .option("header", "true")
  .csv("test.csv").schema

// read the file using your own schema. You can later use this DF
val df = spark.read.schema(schema)
  .option("header", "true")
  .csv("test.csv")

// compare actual and expected column names:
val badColumnNames = fileSchema.fields.map(_.name)
  .zip(schema.fields.map(_.name))
  .filter { case (actual, expected) => actual != expected }

// fail if any inconsistency found:
assert(badColumnNames.isEmpty, 
  s"file schema does not match expected; Bad column names: ${badColumnNames.mkString("; ")}")

145

answered Sep 28 '22 09:09

Tzach Zohar

Related questions
                            
                                How to transfer a float array (without serializing/deserializing) from Scala (JeroMQ) to C (ZMQ)?
                            
                                ScalaFX Button => How to define the action?
                            
                                Function literals vs function values
                            
                                Verify X-Hub-Signature from Facebook
                            
                                OneHotEncoder in Spark Dataframe in Pipeline
                            
                                Who can explain the meaning of this scala code
                            
                                Import different db drivers in Slick
                            
                                get the distinct elements of an ArrayType column in a spark dataframe
                            
                                Scala Nothing datatype
                            
                                How to use User Defined Types in Spark 2.0?
                            
                                Is `PartialFunction extends Function` a violation of LSP?
                            
                                Using typesafe config with Spark on Yarn
                            
                                How to make Reflection for getting the field value by its string name and its original type
                            
                                How to avoid boxing bytes in array in custom datasource?
                            
                                Scala – Make implicit value classes available in another scope
                            
                                Spark: grouping rows in array by key
                            
                                akka-http no stack trace or details on error
                            
                                Scala case class conversion
                            
                                Concatenate many Future[Seq] into one Future[Seq]
                            
                                How to convert unix timestamp to the given timezone with Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark SQL - reading csv with schema

Tags:

validation

csv

schema

scala

apache-spark

Hooberd

People also ask

1 Answers

Tzach Zohar

Recent Activity

Donate For Us