Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to create a Schema file in Spark

I am trying to read a Schema file (which is a text file) and apply it to my CSV file without a header. Since I already have a schema file I don't want to use InferSchema option which is an overhead.

My input schema file looks like below,

"num IntegerType","letter StringType"

I am trying the below code to create a schema file,

val schema_file = spark.read.textFile("D:\\Users\\Documents\\schemaFile.txt")
val struct_type = schema_file.flatMap(x => x.split(",")).map(b => (b.split(" ")(0).stripPrefix("\"").asInstanceOf[String],b.split(" ")(1).stripSuffix("\"").asInstanceOf[org.apache.spark.sql.types.DataType])).foreach(x=>println(x))

I am getting the error as below

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.types.DataType

- field (class: "org.apache.spark.sql.types.DataType", name: "_2") - root class: "scala.Tuple2"

and trying to use this as a schema file while using spark.read.csv like below and write it as an ORC file

  val df=spark.read
      .option("header", false)
      .option("inferSchema", true)
      .option("nullValue", "NULL")

Need help to convert a text file into a schema file and convert my input CSV file to ORC.

like image 291
Gladiator Avatar asked May 24 '18 04:05


2 Answers

To create a schema from a text file create a function to match the type and return DataType as

def getType(raw: String): DataType = {
  raw match {
    case "ByteType" => ByteType
    case "ShortType" => ShortType
    case "IntegerType" => IntegerType
    case "LongType" => LongType
    case "FloatType" => FloatType
    case "DoubleType" => DoubleType
    case "BooleanType" => BooleanType
    case "TimestampType" => TimestampType
    case _ => StringType

Now create a schema by reading a schema file as

val schema = Source.fromFile("schema.txt").getLines().toList
  .flatMap(_.split(",")).map(_.replaceAll("\"", "").split(" "))
  .map(x => StructField(x(0), getType(x(1)), true))

Now read the csv file as

  .option("samplingRatio", "0.01")
  .option("delimiter", "|")
  .option("nullValue", "NULL")

Hope this helps!

like image 148
koiralo Avatar answered Oct 18 '22 00:10


You can create a JSON file named schema.json in the below format

  "fields": [
      "metadata": {},
      "name": "first_fields",
      "nullable": true,
      "type": "string"
      "metadata": {},
      "name": "double_field",
      "nullable": true,
      "type": "double"
  "type": "struct"

Create a struct schema from reading this file

rdd = spark.sparkContext.wholeTextFiles("s3://<bucket>/schema.json")
text = rdd.collect()[0][1]
dict = json.loads(str(text))
custom_schema = StructType.fromJson(dict)

After that, you can use struct as a schema to read csv file

val df=spark.read
      .option("header", false)
      .option("inferSchema", true)
      .option("nullValue", "NULL")
like image 6
ankursingh1000 Avatar answered Oct 18 '22 01:10
