I'm trying to take a hardcoded String and turn it into a 1-row Spark DataFrame (with a single column of type StringType
) such that:
String fizz = "buzz"
Would result with a DataFrame whose .show()
method looks like:
+-----+
| fizz|
+-----+
| buzz|
+-----+
My best attempt thus far has been:
val rawData = List("fizz")
val df = sqlContext.sparkContext.parallelize(Seq(rawData)).toDF()
df.show()
But I get the following compiler error:
java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:413)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)
Any ideas as to where I'm going awry? Also, how do I set "buzz"
as the row value for the fizz
column?
Trying:
sqlContext.sparkContext.parallelize(rawData).toDF()
I get a DF that looks like:
+----+
| _1|
+----+
|buzz|
+----+
You can manually create a PySpark DataFrame using toDF() and createDataFrame() methods, both these function takes different signatures in order to create DataFrame from existing RDD, list, and DataFrame.
In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField . Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.
To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.
Try:
sqlContext.sparkContext.parallelize(rawData).toDF()
In 2.0 you can:
import spark.implicits._
rawData.toDF
Optionally provide a sequence of names for toDF
:
sqlContext.sparkContext.parallelize(rawData).toDF("fizz")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With