Read parquet into spark dataset ignoring missing fields [duplicate]

Question

Lets assume I create a parquet file as follows :

case class A (i:Int,j:Double,s:String)

var l1 = List(A(1,2.0,"s1"),A(2,3.0,"S2"))

val ds = spark.createDataset(l1)
ds.write.parquet("/tmp/test.parquet")

Is it possible to read it into a Dataset of a type with a different schema, where the only difference is few additional fields?

Eg:

case class B (i:Int,j:Double,s:String,d:Double=1.0)  // d is extra and has a default value

Is there a way that i can make this work? :

val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]

himanshuIIITian · Accepted Answer

In Spark, if the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required. It means for the following code to work:

val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]

Following modifications needs to be done:

val ds2 = spark.read.parquet("/tmp/test.parquet").withColumn("d", lit(1D)).as[B]

Or, if creating additional column is not possible, then following can be done:

val ds2 = spark.read.parquet("/tmp/test.parquet").map{
  case row => B(row.getInt(0), row.getDouble(1), row.getString(2))
}

Read parquet into spark dataset ignoring missing fields [duplicate]

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

parquet

apache-spark-2.0

indraneel

1 Answers

himanshuIIITian

Recent Activity

Donate For Us

Read parquet into spark dataset ignoring missing fields [duplicate]

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

parquet

apache-spark-2.0

indraneel

1 Answers

himanshuIIITian

Related questions

Recent Activity

Donate For Us