Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below

my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]

Now, i want to create a Dataframe as follows

---------------------------------
|ID | words                     |
---------------------------------
 1  | ['apple','ball','ballon'] |
 2  | ['cat','camel','james']   |

I even want to add ID column which is not associated in the data

like image 318
user9226665 Avatar asked Oct 23 '25 15:10

user9226665


1 Answers

You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:

from pyspark.sql import Row
R = Row('ID', 'words')

# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show() 
+---+--------------------+
| ID|               words|
+---+--------------------+
|  0|[apple, ball, bal...|
|  1| [cat, camel, james]|
|  2| [none, focus, cake]|
+---+--------------------+
like image 166
Psidom Avatar answered Oct 26 '25 09:10

Psidom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!