Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading csv files with missing columns and random column order

I have a schema that I want to apply to csv files in Databricks. The csv files may contain 6 columns (a,b,c,d,e,f) which can appear in random order in the csv files. It can also occur that one or more columns are missing. So csv files with these headers would be valid

a,b,c,d,e,f
f,e,d,c,a,b
a,b,c
d,e,f

I can create a custom schema, but this does not handle the different order, as well as missing columns. They are applied sequential. Any ideas on how this can be dealt with?

customSchema = StructType() \
  .add("a", DoubleType(), True) \
  .add("b", DoubleType(), True) \
  .add("c", DoubleType(), True) \
  .add("d", DoubleType(), True) \
  .add("e", DoubleType(), True) \
  .add("f", DoubleType(), False)

 
data = sqlContext.read.format("csv") \
  .option("header", "true") \
  .option("delimiter", ",") \
  .schema(customSchema) \
  .load("*.csv")
like image 291
reachify Avatar asked Jul 04 '18 10:07

reachify


1 Answers

You could read the csv file without specifying the schema, and then shape the dataframe the way you like. In scala, this would go as follows:

val df = spark.read.format("csv")
    .option("header", "true")
    .load("x.csv")

val cols = Seq("a", "b", "c", "d", "e", "f")

/* Here I select and cast the column if it exists. 
   I create a null column otherwise */
val shaped_df = df.select( cols.map(c=> 
    if(df.columns.contains(c)) 
        col(c).cast("double") 
    else 
        lit(null).cast("double").alias(c)
) :_* )

shaped_df.printSchema()
root
    |-- a: double (nullable = true)
    |-- b: double (nullable = true)
    |-- c: double (nullable = true)
    |-- d: double (nullable = true)
    |-- e: double (nullable = true)
    |-- f: double (nullable = true)
like image 78
Oli Avatar answered Nov 15 '22 07:11

Oli