Say you have a CSV with three columns: item
, username
, and userid
. It is a fairly simple matter to use Spark's Dataset API to read this in:
case class Flat(item: String, username: String, userid: String)
ds = sparkSession.read.csv("path/to/data").toDF("item", "username", "userid").as[Flat]
Then ds
will be of type Dataset[Flat]
.
But suppose you would prefer that your data have the form Dataset[Nested]
where Nested
is given by:
case class User(name: String, id: String)
case class Nested(item: String, user: User)
One way to do it is to read the data into a Dataset[Flat]
and then apply a map
to transform it into a Dataset[Nested]
, but in practice the Flat
case class often isn't needed for anything else and it makes the code unnecessarily verbose. Is there any way to skip the middleman and directly construct a Dataset[Nested]
?
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and they become the names of the columns.
read. schema("schema") method. Spark Schema defines the structure of the data (column name, datatype, nested columns, nullable e.t.c), and when it specified while reading a file, DataFrame interprets and reads the file in a specified schema, once DataFrame created, it becomes the structure of the DataFrame.
Spark ingests the CSV file in a distributed way. The file must be on a shared drive, distributed file system, or shared via a shared file system mechanism like Dropbox, Box, Nextcloud/Owncloud, etc. In this context, a partition is a dedicated area in the worker's memory. Dataset<Row> df = spark.
PySpark allows people to work with Resilient Distributed Datasets (RDDs) in Python through a library called Py4j. Here are the topics covered in this course: Pyspark Introduction. Pyspark Dataframe Part 1.
Is there any way to skip the middleman and directly construct a Dataset[Nested]?
There is not - Datasets
are match by structure and names. You cannot just have names and data has to be reshaped.
If you prefer to skip Flat
definition just use dynamic API
import org.apache.spark.sql.functions._
ds.select($"item", struct($"name", $"id") as "user").as[Nested]
as[Flat]
doesn't really type check so you don't loose anything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With