The case is really simple, I need to convert a python list into data frame with following code
from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema = StructType([StructField("value", IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df = sqlContext.createDataFrame(rdd, schema) df.show()
it failed with following error:
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj))) TypeError: StructType can not accept object 1 in type <class 'int'>
To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.
To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list of column names.
This solution is also an approach that uses less code, avoids serialization to RDD and is likely easier to understand:
from pyspark.sql.types import IntegerType # notice the variable name (more below) mylist = [1, 2, 3, 4] # notice the parens after the type name spark.createDataFrame(mylist, IntegerType()).show()
NOTE: About naming your variable list
: the term list
is a Python builtin function and as such, it is strongly recommended that we avoid using builtin names as the name/label for our variables because we end up overwriting things like the list()
function. When prototyping something fast and dirty, a number of folks use something like: mylist
.
Please see the below code:
from pyspark.sql import Row li=[1,2,3,4] rdd1 = sc.parallelize(li) row_rdd = rdd1.map(lambda x: Row(x)) df=sqlContext.createDataFrame(row_rdd,['numbers']).show()
df
+-------+ |numbers| +-------+ | 1| | 2| | 3| | 4| +-------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With