How is the Spark select-explode idiom implemented?

Question

Assume we have a DataFrame with a string column, col1, and an array column, col2. I was wondering what happens behind the scenes in the Spark operation:

df.select('col1', explode('col2'))

It seems that select takes a sequence of Column objects as input, and explode returns a Column so the types match. But the column returned by explode('col2') is logically of different length than col1, so I was wondering how select knows to "sync" them when constructing its output DataFrame. I tried looking at the Column class for clues but couldn't really find anything.

zero323 · Accepted Answer

The answer is simple - there is no such data structure as Column. While Spark SQL uses columnar storage for caching and can leverage data layout for some low level operations columns are just descriptions of data and transformations not data containers. So simplifying things a bit explode is yet another flatMap on the Dataset[Row].

How is the Spark select-explode idiom implemented?

Tags:

apache-spark

apache-spark-sql

hillel

1 Answers

zero323

Recent Activity

Donate For Us

How is the Spark select-explode idiom implemented?

Tags:

apache-spark

apache-spark-sql

hillel

1 Answers

zero323

Related questions

Recent Activity

Donate For Us