I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None
.
Scenario:
We have some parquet files that we are reading in that have column1 but not column2.
We load the data in from these parquet files into a Dataset
, and cast it as MyType
.
case class MyType(column1: Option[String], column2: Option[Seq[String]])
sqlContext.read.parquet("dataSource.parquet").as[MyType]
org.apache.spark.sql.AnalysisException: cannot resolve '
column2
' given input columns: [column1];
Is there a way to create the Dataset with column2 data as None
?
In simple cases you can provide an initial schema which is a superset of expected schemas. For example in your case:
val schema = Seq[MyType]().toDF.schema
Seq("a", "b", "c").map(Option(_))
.toDF("column1")
.write.parquet("/tmp/column1only")
val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType]
df.show
+-------+-------+
|column1|column2|
+-------+-------+
| a| null|
| b| null|
| c| null|
+-------+-------+
df.first
MyType = MyType(Some(a),None)
This approach can be a little bit fragile so in general you should rather use SQL literals to fill the blanks:
spark.read.parquet("/tmp/column1only")
// or ArrayType(StringType)
.withColumn("column2", lit(null).cast("array<string>"))
.as[MyType]
.first
MyType = MyType(Some(a),None)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With