Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark dataframe column to list

I am trying to extract the list of column values from a dataframe into a list

+------+----------+------------+
|sno_id|updt_dt   |process_flag|
+------+----------+------------+
| 123  |01-01-2020|     Y      |
+------+----------+------------+
| 234  |01-01-2020|     Y      |
+------+----------+------------+
| 512  |01-01-2020|     Y      |
+------+----------+------------+
| 111  |01-01-2020|     Y      |
+------+----------+------------+

Output should be the list of sno_id ['123','234','512','111'] Then I need to iterate the list to run some logic on each on the list values. I am currently using HiveWarehouseSession to fetch data from hive table into Dataframe by using hive.executeQuery(query)

like image 654
Cavalez Avatar asked Feb 25 '20 19:02

Cavalez


1 Answers

it is pretty easy as you can first collect the df with will return list of Row type then

row_list = df.select('sno_id').collect()

then you can iterate on row type to convert column into list

sno_id_array = [ row.sno_id for row in row_list]

sno_id_array 
['123','234','512','111']

Using Flat map and more optimized solution

sno_id_array = df.select("sno_id ").rdd.flatMap(lambda x: x).collect()
like image 134
Strick Avatar answered Oct 22 '22 03:10

Strick