Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert Row of a Scala DataFrame into case class most efficiently?

Once I have got in Spark some Row class, either Dataframe or Catalyst, I want to convert it to a case class in my code. This can be done by matching

someRow match {case Row(a:Long,b:String,c:Double) => myCaseClass(a,b,c)} 

But it becomes ugly when the row has a huge number of columns, say a dozen of Doubles, some Booleans and even the occasional null.

I would like just to be able to -sorry- cast Row to myCaseClass. Is it possible, or have I already got the most economical syntax?

like image 461
arivero Avatar asked Jan 27 '15 09:01

arivero


People also ask

How do I convert a Row to a DataFrame in spark Scala?

In case of one row, you can run: val dfFromArray = sparkContext. parallelize(Seq(row)). map(row => (row.

How do we convert a DataFrame to a Dataset?

You can call df.as[SomeCaseClass] to convert the DataFrame to a Dataset. You can also deal with tuples while converting a DataFrame to Dataset without using a case class .

What is Row RDD in spark?

Using RDD Row type RDD[Row] to DataFrame Spark createDataFrame() has another signature which takes the RDD[Row] type and schema for column names as arguments. To use this first, we need to convert our “rdd” object from RDD[T] to RDD[Row]. To define a schema, we use StructType that takes an array of StructField.

What is DataFrame DSL?

DataFrame is a data abstraction or a domain-specific language (DSL) for working with structured and semi-structured data, i.e. datasets that you can specify a schema for. DataFrame is a collection of rows with a schema that is the result of executing a structured query (once it will have been executed).


2 Answers

DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.

The conversion from Dataset[Row] to Dataset[Person] is very simple in spark

val DFtoProcess = SQLContext.sql("SELECT * FROM peoples WHERE name='test'")

At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.

// Create an Encoders for Java class (In my eg. Person is a JAVA class) // For scala case class you can pass Person without .class reference val personEncoder = Encoders.bean(Person.class)   val DStoProcess = DFtoProcess.as[Person](personEncoder) 

Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.

Please refer to below link provided by databricks for further details

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

like image 121
Rahul Avatar answered Sep 28 '22 02:09

Rahul


As far as I know you cannot cast a Row to a case class, but I sometimes chose to access the row fields directly, like

map(row => myCaseClass(row.getLong(0), row.getString(1), row.getDouble(2)) 

I find this to be easier, especially if the case class constructor only needs some of the fields from the row.

like image 25
Glennie Helles Sindholt Avatar answered Sep 28 '22 00:09

Glennie Helles Sindholt