Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tuple to data frame in spark scala

I have an array called array list which looks like this

arraylist: Array[(String, Any)] = Array((id,772914), (x4,2), (x5,24), (x6,1), (x7,77491.25), (x8,17911.77778), (x9,225711), (x10,17), (x12,6), (x14,5), (x16,5), (x18,5.0), (x19,8.0), (x20,7959.0), (x21,676.0), (x22,228.5068871), (x23,195.0), (x24,109.6015511), (x25,965.0), (x26,1017.79043), (x27,2.0), (Target,1), (x29,13), (x30,735255.5), (x31,332998.432), (x32,38168.75), (x33,107957.5278), (x34,13), (x35,13), (x36,13), (x37,13), (x38,13), (x39,13), (x40,13), (x41,7), (x42,13), (x43,13), (x44,13), (x45,13), (x46,13), (x47,13), (x48,13), (x49,14.0), (x50,2.588435821), (x51,617127.5), (x52,414663.9738), (x53,39900.0), (x54,16743.15781), (x55,105000.0), (x56,52842.29076), (x57,25750.46154), (x58,8532.045819), (x64,13), (x66,13), (x67,13), (x68,13), (x69,13), (x70,13), (x71,13), (x73,13), (...

I want to convert it to a dataframe with two columns "ID" and value. Fo theis the code I am using is

val df = sc.parallelize(arraylist).toDF("Names","Values")

However I am getting an error

java.lang.UnsupportedOperationException: Schema for type Any is not supported

How can I overcome this problem?

like image 651
Rajarshi Bhadra Avatar asked Dec 08 '16 15:12

Rajarshi Bhadra


People also ask

How do you convert existing RDDs to spark datasets?

Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.

Can we convert RDD to DataFrame?

Method 1: Using createDataframe() function. After creating the RDD we have converted it to Dataframe using createDataframe() function in which we have passed the RDD and defined schema for Dataframe.

How do I convert a DataFrame to Scala?

Convert RDD to DataFrame – Using createDataFrame()and chain it with toDF() to specify names to the columns. Here, we are using scala operator <strong>:_*</strong> to explode columns array to comma-separated values.


1 Answers

Message tells you everything :) Any is not supported as a type of column of DataFrame. Any type can be caused by nulls as the second element of a tuple

Change arraylist type to Array[(String, Int)] (if you can do it manually; if it is deducted by Scala, then check for nulls and invalid values of second element) or create manually schema:

import org.apache.spark.sql.types._
import org.apache.spark.sql._

val arraylist: Array[(String, Any)] = Array(("id",772914), ("x4",2.0), ("x5",24.0));

val schema = StructType(
    StructField("Names", StringType, false) ::
    StructField("Values", DoubleType, false) :: Nil)
val rdd = sc.parallelize (arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))

val df = sqlContext.createDataFrame(rdd, schema)

df.show()

Note: createDataFrame requires RDD[Row], so I'm converting RDD of tuple to RDD of Row

like image 156
T. Gawęda Avatar answered Sep 26 '22 22:09

T. Gawęda