Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark convert a standard list to data frame [duplicate]

The case is really simple, I need to convert a python list into data frame with following code

from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType  schema = StructType([StructField("value", IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df = sqlContext.createDataFrame(rdd, schema)  df.show() 

it failed with following error:

    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj))) TypeError: StructType can not accept object 1 in type <class 'int'> 
like image 973
seiya Avatar asked Jan 25 '18 17:01

seiya


People also ask

How do I change a list to a DataFrame in PySpark?

To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.

How do you create a PySpark DataFrame from a list of tuples?

To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list of column names.


2 Answers

This solution is also an approach that uses less code, avoids serialization to RDD and is likely easier to understand:

from pyspark.sql.types import IntegerType  # notice the variable name (more below) mylist = [1, 2, 3, 4]  # notice the parens after the type name spark.createDataFrame(mylist, IntegerType()).show() 

NOTE: About naming your variable list: the term list is a Python builtin function and as such, it is strongly recommended that we avoid using builtin names as the name/label for our variables because we end up overwriting things like the list() function. When prototyping something fast and dirty, a number of folks use something like: mylist.

like image 179
E. Ducateme Avatar answered Oct 05 '22 11:10

E. Ducateme


Please see the below code:

    from pyspark.sql import Row     li=[1,2,3,4]     rdd1 = sc.parallelize(li)     row_rdd = rdd1.map(lambda x: Row(x))     df=sqlContext.createDataFrame(row_rdd,['numbers']).show() 

df

+-------+ |numbers| +-------+ |      1| |      2| |      3| |      4| +-------+ 
like image 29
user15051990 Avatar answered Oct 05 '22 11:10

user15051990