I have a schema that I want to apply to csv files in Databricks. The csv files may contain 6 columns (a,b,c,d,e,f) which can appear in random order in the csv files. It can also occur that one or more columns are missing. So csv files with these headers would be valid
a,b,c,d,e,f
f,e,d,c,a,b
a,b,c
d,e,f
I can create a custom schema, but this does not handle the different order, as well as missing columns. They are applied sequential. Any ideas on how this can be dealt with?
customSchema = StructType() \
.add("a", DoubleType(), True) \
.add("b", DoubleType(), True) \
.add("c", DoubleType(), True) \
.add("d", DoubleType(), True) \
.add("e", DoubleType(), True) \
.add("f", DoubleType(), False)
data = sqlContext.read.format("csv") \
.option("header", "true") \
.option("delimiter", ",") \
.schema(customSchema) \
.load("*.csv")
You could read the csv file without specifying the schema, and then shape the dataframe the way you like. In scala, this would go as follows:
val df = spark.read.format("csv")
.option("header", "true")
.load("x.csv")
val cols = Seq("a", "b", "c", "d", "e", "f")
/* Here I select and cast the column if it exists.
I create a null column otherwise */
val shaped_df = df.select( cols.map(c=>
if(df.columns.contains(c))
col(c).cast("double")
else
lit(null).cast("double").alias(c)
) :_* )
shaped_df.printSchema()
root
|-- a: double (nullable = true)
|-- b: double (nullable = true)
|-- c: double (nullable = true)
|-- d: double (nullable = true)
|-- e: double (nullable = true)
|-- f: double (nullable = true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With