I have an array called array list which looks like this
arraylist: Array[(String, Any)] = Array((id,772914), (x4,2), (x5,24), (x6,1), (x7,77491.25), (x8,17911.77778), (x9,225711), (x10,17), (x12,6), (x14,5), (x16,5), (x18,5.0), (x19,8.0), (x20,7959.0), (x21,676.0), (x22,228.5068871), (x23,195.0), (x24,109.6015511), (x25,965.0), (x26,1017.79043), (x27,2.0), (Target,1), (x29,13), (x30,735255.5), (x31,332998.432), (x32,38168.75), (x33,107957.5278), (x34,13), (x35,13), (x36,13), (x37,13), (x38,13), (x39,13), (x40,13), (x41,7), (x42,13), (x43,13), (x44,13), (x45,13), (x46,13), (x47,13), (x48,13), (x49,14.0), (x50,2.588435821), (x51,617127.5), (x52,414663.9738), (x53,39900.0), (x54,16743.15781), (x55,105000.0), (x56,52842.29076), (x57,25750.46154), (x58,8532.045819), (x64,13), (x66,13), (x67,13), (x68,13), (x69,13), (x70,13), (x71,13), (x73,13), (...
I want to convert it to a dataframe with two columns "ID" and value. Fo theis the code I am using is
val df = sc.parallelize(arraylist).toDF("Names","Values")
However I am getting an error
java.lang.UnsupportedOperationException: Schema for type Any is not supported
How can I overcome this problem?
Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.
Method 1: Using createDataframe() function. After creating the RDD we have converted it to Dataframe using createDataframe() function in which we have passed the RDD and defined schema for Dataframe.
Convert RDD to DataFrame – Using createDataFrame()and chain it with toDF() to specify names to the columns. Here, we are using scala operator <strong>:_*</strong> to explode columns array to comma-separated values.
Message tells you everything :) Any is not supported as a type of column of DataFrame. Any
type can be caused by nulls as the second element of a tuple
Change arraylist type to Array[(String, Int)]
(if you can do it manually; if it is deducted by Scala, then check for nulls and invalid values of second element) or create manually schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
val arraylist: Array[(String, Any)] = Array(("id",772914), ("x4",2.0), ("x5",24.0));
val schema = StructType(
StructField("Names", StringType, false) ::
StructField("Values", DoubleType, false) :: Nil)
val rdd = sc.parallelize (arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
val df = sqlContext.createDataFrame(rdd, schema)
df.show()
Note: createDataFrame requires RDD[Row], so I'm converting RDD of tuple to RDD of Row
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With