Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is the Spark select-explode idiom implemented?

Assume we have a DataFrame with a string column, col1, and an array column, col2. I was wondering what happens behind the scenes in the Spark operation:

df.select('col1', explode('col2'))

It seems that select takes a sequence of Column objects as input, and explode returns a Column so the types match. But the column returned by explode('col2') is logically of different length than col1, so I was wondering how select knows to "sync" them when constructing its output DataFrame. I tried looking at the Column class for clues but couldn't really find anything.

like image 986
hillel Avatar asked Jun 26 '16 12:06

hillel


1 Answers

The answer is simple - there is no such data structure as Column. While Spark SQL uses columnar storage for caching and can leverage data layout for some low level operations columns are just descriptions of data and transformations not data containers. So simplifying things a bit explode is yet another flatMap on the Dataset[Row].

like image 62
zero323 Avatar answered Oct 14 '22 04:10

zero323