Say you have a CSV with three columns: <code>item</code>, <code>username</code>, and <code>userid</code>. It is a fairly simple matter to use Spark's Dataset API to read this in: <pre class="prettyprint"><code>case class Flat(item: String, username: String, userid: String) ds = sparkSession.read.csv("path/to/data").toDF("item", "username", "userid").as[Flat] </code></pre> Then <code>ds</code> will be of type <code>Dataset[Flat]</code>. But suppose you would prefer that your data have the form <code>Dataset[Nested]</code> where <code>Nested</code> is given by: <pre class="prettyprint"><code>case class User(name: String, id: String) case class Nested(item: String, user: User) </code></pre> One way to do it is to read the data into a <code>Dataset[Flat]</code> and then apply a <code>map</code> to transform it into a <code>Dataset[Nested]</code>, but in practice the <code>Flat</code> case class often isn't needed for anything else and it makes the code unnecessarily verbose. Is there any way to skip the middleman and directly construct a <code>Dataset[Nested]</code>?

<blockquote> Is there any way to skip the middleman and directly construct a Dataset[Nested]? </blockquote> There is not - <code>Datasets</code> are match by structure and names. You cannot just have names and data has to be reshaped. If you prefer to skip <code>Flat</code> definition just use dynamic API <pre class="prettyprint"><code>import org.apache.spark.sql.functions._ ds.select($"item", struct($"name", $"id") as "user").as[Nested] </code></pre> <code>as[Flat]</code> doesn't really type check so you don't loose anything.

Can Spark read data directly into a nested case class?

Tags:

scala

apache-spark

apache-spark-dataset

Say you have a CSV with three columns: item, username, and userid. It is a fairly simple matter to use Spark's Dataset API to read this in:

case class Flat(item: String, username: String, userid: String)
ds = sparkSession.read.csv("path/to/data").toDF("item", "username", "userid").as[Flat]

Then ds will be of type Dataset[Flat].

But suppose you would prefer that your data have the form Dataset[Nested] where Nested is given by:

case class User(name: String, id: String)
case class Nested(item: String, user: User)

One way to do it is to read the data into a Dataset[Flat] and then apply a map to transform it into a Dataset[Nested], but in practice the Flat case class often isn't needed for anything else and it makes the code unnecessarily verbose. Is there any way to skip the middleman and directly construct a Dataset[Nested]?

745

asked Dec 18 '17 01:12

Paul Siegel

1 Answers

Is there any way to skip the middleman and directly construct a Dataset[Nested]?

There is not - Datasets are match by structure and names. You cannot just have names and data has to be reshaped.

If you prefer to skip Flat definition just use dynamic API

import org.apache.spark.sql.functions._

ds.select($"item", struct($"name", $"id") as "user").as[Nested]

as[Flat] doesn't really type check so you don't loose anything.

answered Nov 04 '22 08:11

Alper t. Turker

Related questions
                            
                                How to handle exception with ask pattern and supervision
                            
                                Cannot prove that Unit <:< (T, U)
                            
                                Using Scala continuations with while loops
                            
                                Scala worksheet not working in Intellij
                            
                                Are imports and conditionals in Play's routes file possible?
                            
                                How do you perform blocking IO in apache spark job?
                            
                                How to fix the Product Type Inferred error from Scala's WartRemover tool
                            
                                Forwarding HTTP/REST Request to another REST server in Spray
                            
                                How to convert matrix to RDD[Vector] in spark
                            
                                Scala tree recursive fold method
                            
                                java.lang.NoSuchMethodError Jackson databind and Spark
                            
                                Throwing Exception in Foreach/Map Block
                            
                                How to use StaticQuery in Slick 3.0.0?
                            
                                Get all the classes that implements a trait in Scala using reflection
                            
                                How can I roll back an integration test with Slick 3 + Specs2?
                            
                                snakeyaml and spark results in an inability to construct objects
                            
                                Reading in multiple files compressed in tar.gz archive into Spark [duplicate]
                            
                                Spark is not using all configured memory
                            
                                Why does scala fail to compile when method is overloaded in a seemingly unrelated way?
                            
                                Scala stackable traits

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With