Once I have got in Spark some Row class, either Dataframe or Catalyst, I want to convert it to a case class in my code. This can be done by matching
someRow match {case Row(a:Long,b:String,c:Double) => myCaseClass(a,b,c)}
But it becomes ugly when the row has a huge number of columns, say a dozen of Doubles, some Booleans and even the occasional null.
I would like just to be able to -sorry- cast Row to myCaseClass. Is it possible, or have I already got the most economical syntax?
In case of one row, you can run: val dfFromArray = sparkContext. parallelize(Seq(row)). map(row => (row.
You can call df.as[SomeCaseClass] to convert the DataFrame to a Dataset. You can also deal with tuples while converting a DataFrame to Dataset without using a case class .
Using RDD Row type RDD[Row] to DataFrame Spark createDataFrame() has another signature which takes the RDD[Row] type and schema for column names as arguments. To use this first, we need to convert our “rdd” object from RDD[T] to RDD[Row]. To define a schema, we use StructType that takes an array of StructField.
DataFrame is a data abstraction or a domain-specific language (DSL) for working with structured and semi-structured data, i.e. datasets that you can specify a schema for. DataFrame is a collection of rows with a schema that is the result of executing a structured query (once it will have been executed).
DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.
The conversion from Dataset[Row] to Dataset[Person] is very simple in spark
val DFtoProcess = SQLContext.sql("SELECT * FROM peoples WHERE name='test'")
At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.
// Create an Encoders for Java class (In my eg. Person is a JAVA class) // For scala case class you can pass Person without .class reference val personEncoder = Encoders.bean(Person.class) val DStoProcess = DFtoProcess.as[Person](personEncoder)
Now, Spark converts the Dataset[Row] -> Dataset[Person]
type-specific Scala / Java JVM object, as dictated by the class Person.
Please refer to below link provided by databricks for further details
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
As far as I know you cannot cast a Row to a case class, but I sometimes chose to access the row fields directly, like
map(row => myCaseClass(row.getLong(0), row.getString(1), row.getDouble(2))
I find this to be easier, especially if the case class constructor only needs some of the fields from the row.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With