What's the difference between explode
function and explode
operator?
Despite explode being deprecated (that we could then translate the main question to the difference between explode function and flatMap operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results.
What is the difference between explode and posexplode functions in Hive? Both explode and posexplode are User Defined Table generating Functions. UDTFs operate on single rows and produce multiple rows as output. There are 2 flavors of explode, one flavor takes an Array and another takes a Map.
An operator is a special type of function. Any linear map (i.e. linear function) of vector spaces can be called an operator; this is most common when the map is thought of as "acting on" a vector space. In higher math, one often times has special names for functions with special properties.
spark.sql.functions.explode. explode function creates a new row for each element in the given array or map column (in a DataFrame). explode creates a Column.
flatMap is much better in performance in comparison to explode as flatMap require much lesser data shuffle. If you are processing big data (>5 GB) the performance difference could be seen evidently.
spark.sql.functions.explode
explode
function creates a new row for each element in the given array or map column (in a DataFrame).
val signals: DataFrame = spark.read.json(signalsJson)
signals.withColumn("element", explode($"data.datapayload"))
explode
creates a Column.
See functions object and the example in How to unwind array in DataFrame (from JSON)?
Dataset<Row> explode
/ flatMap
operator (method)explode
operator is almost the explode
function.
From the scaladoc:
explode
returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
ds.flatMap(_.words.split(" "))
Please note that (again quoting the scaladoc):
Deprecated (Since version 2.0.0) use
flatMap()
orselect()
withfunctions.explode()
instead
See Dataset API and the example in How to split multi-value column into separate rows using typed Dataset?
Despite explode
being deprecated (that we could then translate the main question to the difference between explode
function and flatMap
operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results. That often leads to discussions what's better and usually boils down to personal preference or coding style.
One could also say that flatMap
(i.e. explode
operator) is more Scala-ish given how ubiquitous flatMap
is in Scala programming (mainly hidden behind for-comprehension).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With