Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - creating schema programmatically with different data types

I have a dataset consisting of 7-8 fields which are of type String, Int & Float.

Am trying to create Schema by Programmatic approach by using this :

val schema = StructType(header.split(",").map(column => StructField(column, StringType, true)))

And Then mapping it to Row type like :

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8)))

But after creating DataFrame when i use DF.show() it gives error for the Integer field.

So how to create such schema where we have multiple data type in the dataset

like image 620
AJm Avatar asked Nov 26 '25 03:11

AJm


1 Answers

The problem you have in your code is that you are assigning all the fields as StringType.

Assuming that in the header you have only the name of the fields, then you can't guess the type.

Let's assume that the header string is like this

val header = "field1:Int,field2:Double,field3:String"

Then the code should be

def inferType(field: String) = field.split(":")(1) match {
   case "Int" => IntegerType
   case "Double" => DoubleType
   case "String" => StringType
   case _ => StringType
}

val schema = StructType(header.split(",").map(column => StructField(column, inferType(column), true)))

For the header string example you get

root
 |-- field1:Int: integer (nullable = true)
 |-- field2:Double: double (nullable = true)
 |-- field3:String: string (nullable = true)

On the other hand. If what you need it's a data frame from text, I would suggest that you create the DataFrame directly from the file itself. It's pointless to create it from an RDD.

val fileReader = spark.read.format("com.databricks.spark.csv")
  .option("mode", "DROPMALFORMED")
  .option("header", "true")
  .option("inferschema", "true")
  .option("delimiter", ",")

val df = fileReader.load(PATH_TO_FILE)
like image 113
elghoto Avatar answered Nov 28 '25 15:11

elghoto