Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I collect a single column in Spark?

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected.

Here is an example:

df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df['array'].collect()

This produces the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable

How can I use the collect() function on a single column?

like image 692
Michal Avatar asked Feb 19 '16 00:02

Michal


1 Answers

Spark >= 2.0

Starting from Spark 2.0.0 you need to explicitly specify .rdd in order to use flatMap

df.select("array").rdd.flatMap(lambda x: x).collect()

Spark < 2.0

Just select and flatMap:

df.select("array").flatMap(lambda x: x).collect()
## [[1, 2, 3]] 
like image 194
zero323 Avatar answered Sep 21 '22 12:09

zero323