Assume we have a DataFrame with a string column, col1
, and an array column, col2
. I was wondering what happens behind the scenes in the Spark operation:
df.select('col1', explode('col2'))
It seems that select takes a sequence of Column
objects as input, and explode returns a Column
so the types match. But the column returned by explode('col2')
is logically of different length than col1
, so I was wondering how select knows to "sync" them when constructing its output DataFrame. I tried looking at the Column class for clues but couldn't really find anything.
The answer is simple - there is no such data structure as Column
. While Spark SQL uses columnar storage for caching and can leverage data layout for some low level operations columns are just descriptions of data and transformations not data containers. So simplifying things a bit explode
is yet another flatMap
on the Dataset[Row]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With