Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Spark read data directly into a nested case class?

Say you have a CSV with three columns: item, username, and userid. It is a fairly simple matter to use Spark's Dataset API to read this in:

case class Flat(item: String, username: String, userid: String)
ds = sparkSession.read.csv("path/to/data").toDF("item", "username", "userid").as[Flat]

Then ds will be of type Dataset[Flat].

But suppose you would prefer that your data have the form Dataset[Nested] where Nested is given by:

case class User(name: String, id: String)
case class Nested(item: String, user: User)

One way to do it is to read the data into a Dataset[Flat] and then apply a map to transform it into a Dataset[Nested], but in practice the Flat case class often isn't needed for anything else and it makes the code unnecessarily verbose. Is there any way to skip the middleman and directly construct a Dataset[Nested]?

like image 745
Paul Siegel Avatar asked Dec 18 '17 01:12

Paul Siegel


People also ask

How do you use case class in spark?

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and they become the names of the columns.

Is spark schema on read?

read. schema("schema") method. Spark Schema defines the structure of the data (column name, datatype, nested columns, nullable e.t.c), and when it specified while reading a file, DataFrame interprets and reads the file in a specified schema, once DataFrame created, it becomes the structure of the DataFrame.

How does spark transfer data?

Spark ingests the CSV file in a distributed way. The file must be on a shared drive, distributed file system, or shared via a shared file system mechanism like Dropbox, Box, Nextcloud/Owncloud, etc. In this context, a partition is a dedicated area in the worker's memory. Dataset<Row> df = spark.

Can we use dataset in Pyspark?

PySpark allows people to work with Resilient Distributed Datasets (RDDs) in Python through a library called Py4j. Here are the topics covered in this course: Pyspark Introduction. Pyspark Dataframe Part 1.


1 Answers

Is there any way to skip the middleman and directly construct a Dataset[Nested]?

There is not - Datasets are match by structure and names. You cannot just have names and data has to be reshaped.

If you prefer to skip Flat definition just use dynamic API

import org.apache.spark.sql.functions._

ds.select($"item", struct($"name", $"id") as "user").as[Nested]

as[Flat] doesn't really type check so you don't loose anything.

like image 58
Alper t. Turker Avatar answered Nov 04 '22 08:11

Alper t. Turker