What's the difference between <code>explode</code> function and <code>explode</code> operator?

<h3><code>spark.sql.functions.explode</code></h3> <code>explode</code> function creates a new row for each element in the given array or map column (in a DataFrame). <pre class="prettyprint"><code>val signals: DataFrame = spark.read.json(signalsJson) signals.withColumn("element", explode($"data.datapayload")) </code></pre> <code>explode</code> creates a Column. See functions object and the example in How to unwind array in DataFrame (from JSON)? <h3> <code>Dataset<Row> explode</code> / <code>flatMap</code> operator (method)</h3> <code>explode</code> operator is almost the <code>explode</code> function. From the scaladoc: <blockquote> <code>explode</code> returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function. </blockquote> <pre class="prettyprint"><code>ds.flatMap(_.words.split(" ")) </code></pre> Please note that (again quoting the scaladoc): <blockquote> Deprecated (Since version 2.0.0) use <code>flatMap()</code> or <code>select()</code> with <code>functions.explode()</code> instead </blockquote> See Dataset API and the example in How to split multi-value column into separate rows using typed Dataset? <hr> Despite <code>explode</code> being deprecated (that we could then translate the main question to the difference between <code>explode</code> function and <code>flatMap</code> operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results. That often leads to discussions what's better and usually boils down to personal preference or coding style. One could also say that <code>flatMap</code> (i.e. <code>explode</code> operator) is more Scala-ish given how ubiquitous <code>flatMap</code> is in Scala programming (mainly hidden behind for-comprehension).

What's the difference between explode function and operator?

2 Answers

flatMap is much better in performance in comparison to explode as flatMap require much lesser data shuffle. If you are processing big data (>5 GB) the performance difference could be seen evidently.

answered Oct 01 '22 06:10

Asid

`spark.sql.functions.explode`

explode function creates a new row for each element in the given array or map column (in a DataFrame).

val signals: DataFrame = spark.read.json(signalsJson)
signals.withColumn("element", explode($"data.datapayload"))

explode creates a Column.

See functions object and the example in How to unwind array in DataFrame (from JSON)?

`Dataset<Row> explode` / `flatMap` operator (method)

explode operator is almost the explode function.

From the scaladoc:

explode returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

ds.flatMap(_.words.split(" "))

Please note that (again quoting the scaladoc):

Deprecated (Since version 2.0.0) use flatMap() or select() with functions.explode() instead

See Dataset API and the example in How to split multi-value column into separate rows using typed Dataset?

Despite explode being deprecated (that we could then translate the main question to the difference between explode function and flatMap operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results. That often leads to discussions what's better and usually boils down to personal preference or coding style.

One could also say that flatMap (i.e. explode operator) is more Scala-ish given how ubiquitous flatMap is in Scala programming (mainly hidden behind for-comprehension).

answered Oct 01 '22 07:10

Jacek Laskowski

Related questions
                            
                                How to insert spark structured streaming DataFrame to Hive external table/location?
                            
                                Spark (Scala) filter array of structs without explode
                            
                                Pure Java/Scala code for writing Tensorflow TFRecords data file
                            
                                Spark: saveAsTextFile without compression
                            
                                Encode an ADT / sealed trait hierarchy into Spark DataSet column
                            
                                where does df.cache() is stored
                            
                                How to set up Spark with Zookeeper for HA?
                            
                                Error in running job on Spark 1.4.0 with Jackson module with ScalaObjectMapper
                            
                                Is reading a CSV file from S3 into a Spark dataframe expected to be so slow?
                            
                                How to set a custom environment variable in EMR to be available for a spark Application
                            
                                How to list all tables in database using Spark SQL?
                            
                                Spark Streaming: Micro batches Parallel Execution
                            
                                Spark Structured Streaming Checkpoint Cleanup
                            
                                Collect rows as list with group by apache spark
                            
                                How to query to mongo using spark?
                            
                                What is "Hadoop" - the definition of Hadoop?
                            
                                spark - filter within map
                            
                                How to create InputDStream with offsets in PySpark (using KafkaUtils.createDirectStream)?
                            
                                Batched API call inside apache spark?
                            
                                Spark SQL is not converting timezone correctly [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between explode function and operator?

Tags:

apache-spark

apache-spark-sql

Jacek Laskowski

People also ask

2 Answers

Asid

`spark.sql.functions.explode`

`Dataset<Row> explode` / `flatMap` operator (method)

Jacek Laskowski

Recent Activity

Donate For Us

What's the difference between explode function and operator?

Tags:

apache-spark

apache-spark-sql

Jacek Laskowski

People also ask

2 Answers

Asid

spark.sql.functions.explode

Dataset<Row> explode / flatMap operator (method)

Jacek Laskowski

Related questions

Recent Activity

Donate For Us

`spark.sql.functions.explode`

`Dataset<Row> explode` / `flatMap` operator (method)