I am new to Scala. I am trying to convert a scala list (which is holding the results of some calculated data on a source DataFrame) to Dataframe or Dataset. I am not finding any direct method to do that. However, I have tried the following process to convert my list to DataSet but it seems not working. I am providing the 3 situations below.
Can someone please provide me some ray of hope, how to do this conversion? Thanks.
import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader}
import java.sql.{Connection, DriverManager, ResultSet, Timestamp}
import scala.collection._
case class TestPerson(name: String, age: Long, salary: Double)
var tom = new TestPerson("Tom Hanks",37,35.5)
var sam = new TestPerson("Sam Smith",40,40.5)
val PersonList = mutable.MutableList[TestPerson]()
//Adding data in list
PersonList += tom
PersonList += sam
//Situation 1: Trying to create dataset from List of objects:- Result:Error
//Throwing error
var personDS = Seq(PersonList).toDS()
/*
ERROR:
error: Unable to find encoder for type stored in a Dataset. Primitive types
(Int, String, etc) and Product types (case classes) are supported by
importing sqlContext.implicits._ Support for serializing other types will
be added in future releases.
var personDS = Seq(PersonList).toDS()
*/
//Situation 2: Trying to add data 1-by-1 :- Result: not working as desired.
the last record overwriting any existing data in the DS
var personDS = Seq(tom).toDS()
personDS = Seq(sam).toDS()
personDS += sam //not working. throwing error
//Situation 3: Working. However, I am having consolidated data in the list
which I want to convert to DS; if I loop the results of the list in comma
separated values and then pass that here, it will work but will create an
extra loop in the code, which I want to avoid.
var personDS = Seq(tom,sam).toDS()
scala> personDS.show()
+---------+---+------+
| name|age|salary|
+---------+---+------+
|Tom Hanks| 37| 35.5|
|Sam Smith| 40| 40.5|
+---------+---+------+
List = dataFrame. values. tolist() # Print converted list data as result.
Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.
toDF() toDF() method provides a very concise way to create a Dataframe. This method can be applied to a sequence of objects. To access the toDF() method, we have to import spark. implicits.
Try without Seq
:
case class TestPerson(name: String, age: Long, salary: Double)
val tom = TestPerson("Tom Hanks",37,35.5)
val sam = TestPerson("Sam Smith",40,40.5)
val PersonList = mutable.MutableList[TestPerson]()
PersonList += tom
PersonList += sam
val personDS = PersonList.toDS()
println(personDS.getClass)
personDS.show()
val personDF = PersonList.toDF()
println(personDF.getClass)
personDF.show()
personDF.select("name", "age").show()
Output:
class org.apache.spark.sql.Dataset
+---------+---+------+
| name|age|salary|
+---------+---+------+
|Tom Hanks| 37| 35.5|
|Sam Smith| 40| 40.5|
+---------+---+------+
class org.apache.spark.sql.DataFrame
+---------+---+------+
| name|age|salary|
+---------+---+------+
|Tom Hanks| 37| 35.5|
|Sam Smith| 40| 40.5|
+---------+---+------+
+---------+---+
| name|age|
+---------+---+
|Tom Hanks| 37|
|Sam Smith| 40|
+---------+---+
Also, make sure to move the declaration of the case class TestPerson
outside the scope of your object.
case class TestPerson(name: String, age: Long, salary: Double)
val spark = SparkSession.builder().appName("List to Dataset").master("local[*]").getOrCreate()
var tom = new TestPerson("Tom Hanks",37,35.5)
var sam = new TestPerson("Sam Smith",40,40.5)
// mutable.MutableList[TestPerson]() is not required , i used below way which was
// cleaner
val PersonList = List(tom,sam)
import spark.implicits._
PersonList.toDS().show
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With