I am reading schema of the data frame from a text file. The file looks like
id,1,bigint
price,2,bigint
sqft,3,bigint
zip_id,4,int
name,5,string
and I am mapping parsed data types to Spark Sql datatypes.The code for creating data frame is -
var schemaSt = new ListBuffer[(String,String)]()
// read schema from file
for (line <- Source.fromFile("meta.txt").getLines()) {
val word = line.split(",")
schemaSt += ((word(0),word(2)))
}
// map datatypes
val types = Map("int" -> IntegerType, "bigint" -> LongType)
.withDefault(_ => StringType)
val schemaChanged = schemaSt.map(x => (x._1,types(x._2))
// read data source
val lines = spark.sparkContext.textFile("data source path")
val fields = schemaChanged.map(x => StructField(x._1, x._2, nullable = true)).toList
val schema = StructType(fields)
val rowRDD = lines
.map(_.split("\t"))
.map(attributes => Row.fromSeq(attributes))
// Apply the schema to the RDD
val new_df = spark.createDataFrame(rowRDD, schema)
new_df.show(5)
new_df.printSchema()
but the above works only for StringType. For IntegerType and LongType, it is throwing exceptions -
java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
and
java.lang.RuntimeException: java.lang.String is not a valid external type for schema of bigint.
Thanks in advance!
I had the same problem and its cause is the Row.fromSeq()
call.
If it is called on the array of String
, the resulting Row
is the row of String
's. Which does not match the type of the 2nd column in your schema (bigint
or int
).
In order to get the valid dataframe as a result of Row.fromSeq(values: Seq[Any])
, the elements of the values
argument have to be of the type that corresponds to your schema.
You are trying to store strings in numerically typed columns.
You need to cast string encoded numerical data to the appropriate numerical types while parsing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With