I want to create a sample single-column DataFrame, but the following code is not working:
df = spark.createDataFrame(["10","11","13"], ("age")) ## ValueError ## ... ## ValueError: Could not parse datatype: age
The expected result:
age 10 11 13
In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField . Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.
To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.
In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .
the following code is not working
With single element you need a schema as type
spark.createDataFrame(["10","11","13"], "string").toDF("age")
or DataType
:
from pyspark.sql.types import StringType spark.createDataFrame(["10","11","13"], StringType()).toDF("age")
With name elements should be tuples and schema as sequence:
spark.createDataFrame([("10", ), ("11", ), ("13", )], ["age"])
Well .. There is some pretty easy method for creating sample dataframe in PySpark
>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF() >>> df.show() +---+---+---+ | _1| _2| _3| +---+---+---+ | 1| 2| 3| | 2| 3| 4| +---+---+---+
to create with some column names
>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c")) >>> df1.show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 3| | 2| 3| 4| +---+---+---+
In this way, no need to define schema too.Hope this is the simplest way
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With