Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:

// sc is the SparkContext, while sqlContext is the SQLContext.

// Define the case class and raw data
case class Dog(name: String)
val data = Array(
    Dog("Rex"),
    Dog("Fido")
)

// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)

// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)

// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])

// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()

The output I'm seeing is:

Dog(Rex)
Dog(Fido)
++
||
++
||
||
++

What am I missing?

Thanks!

like image 311
sparkour Avatar asked May 03 '16 12:05

sparkour


2 Answers

All you need is just

val dogDF = sqlContext.createDataFrame(dogRDD)

Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.

like image 178
Vitalii Kotliarenko Avatar answered Sep 30 '22 08:09

Vitalii Kotliarenko


You can create a DataFrame directly from a Seq of case class instances using toDF as follows:

val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
like image 23
David Griffin Avatar answered Sep 30 '22 08:09

David Griffin