Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert a list of lists in a Dataframe in Pyspark, being each list the values of each attribute?

I have a list of list of type:

[[1, 2, 3], ['A', 'B', 'C'], ['aa', 'bb', 'cc']]

Each list contains the values of the attributes 'A1', 'A2', and 'A3'.

I want to get the next dataframe:

+----------+----------+----------+ 
| A1       | A2       | A3       |
+----------+----------+----------+ 
| 1        | A        | aa       |
+----------+----------+----------+ 
| 2        | B        | bb       |
+----------+----------+----------+ 
| 3        | C        | cc       |
+----------+----------+----------+ 

How can I do it?

like image 549
jartymcfly Avatar asked Oct 23 '17 13:10

jartymcfly


1 Answers

You can create a Row Class with the header as fields, and use zip to loop through the list row wise and construct a row object for each row:

lst = [[1, 2, 3], ['A', 'B', 'C'], ['aa', 'bb', 'cc']]

from pyspark.sql import Row

R = Row("A1", "A2", "A3")
sc.parallelize([R(*r) for r in zip(*lst)]).toDF().show()
+---+---+---+
| A1| A2| A3|
+---+---+---+
|  1|  A| aa|
|  2|  B| bb|
|  3|  C| cc|
+---+---+---+

Or if you have pandas installed, create a pandas data frame first; You can create a spark data frame from pandas data frame directly by using spark.createDataFrame:

import pandas as pd
headers = ['A1', 'A2', 'A3']

pdf = pd.DataFrame.from_dict(dict(zip(headers, lst)))
spark.createDataFrame(pdf).show()
+---+---+---+
| A1| A2| A3|
+---+---+---+
|  1|  A| aa|
|  2|  B| bb|
|  3|  C| cc|
+---+---+---+
like image 166
Psidom Avatar answered Nov 15 '22 00:11

Psidom