Reading csv files with missing columns and random column order

Question

I have a schema that I want to apply to csv files in Databricks. The csv files may contain 6 columns (a,b,c,d,e,f) which can appear in random order in the csv files. It can also occur that one or more columns are missing. So csv files with these headers would be valid

a,b,c,d,e,f
f,e,d,c,a,b
a,b,c
d,e,f

I can create a custom schema, but this does not handle the different order, as well as missing columns. They are applied sequential. Any ideas on how this can be dealt with?

customSchema = StructType() \
  .add("a", DoubleType(), True) \
  .add("b", DoubleType(), True) \
  .add("c", DoubleType(), True) \
  .add("d", DoubleType(), True) \
  .add("e", DoubleType(), True) \
  .add("f", DoubleType(), False)

 
data = sqlContext.read.format("csv") \
  .option("header", "true") \
  .option("delimiter", ",") \
  .schema(customSchema) \
  .load("*.csv")

Oli · Accepted Answer

You could read the csv file without specifying the schema, and then shape the dataframe the way you like. In scala, this would go as follows:

val df = spark.read.format("csv")
    .option("header", "true")
    .load("x.csv")

val cols = Seq("a", "b", "c", "d", "e", "f")

/* Here I select and cast the column if it exists. 
   I create a null column otherwise */
val shaped_df = df.select( cols.map(c=> 
    if(df.columns.contains(c)) 
        col(c).cast("double") 
    else 
        lit(null).cast("double").alias(c)
) :_* )

shaped_df.printSchema()

root
    |-- a: double (nullable = true)
    |-- b: double (nullable = true)
    |-- c: double (nullable = true)
    |-- d: double (nullable = true)
    |-- e: double (nullable = true)
    |-- f: double (nullable = true)

Reading csv files with missing columns and random column order

Tags:

csv

apache-spark

databricks

reachify

1 Answers

Oli

Recent Activity

Donate For Us

Reading csv files with missing columns and random column order

Tags:

csv

apache-spark

databricks

reachify

1 Answers

Oli

Related questions

Recent Activity

Donate For Us