Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a Spark DataFrame from a single string

I'm trying to take a hardcoded String and turn it into a 1-row Spark DataFrame (with a single column of type StringType) such that:

String fizz = "buzz"

Would result with a DataFrame whose .show() method looks like:

+-----+
| fizz|
+-----+
| buzz|
+-----+

My best attempt thus far has been:

val rawData = List("fizz")
val df = sqlContext.sparkContext.parallelize(Seq(rawData)).toDF()

df.show()

But I get the following compiler error:

java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
    at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:413)
    at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)

Any ideas as to where I'm going awry? Also, how do I set "buzz" as the row value for the fizz column?


Update:

Trying:

sqlContext.sparkContext.parallelize(rawData).toDF()

I get a DF that looks like:

+----+
|  _1|
+----+
|buzz|
+----+
like image 446
smeeb Avatar asked Oct 10 '16 17:10

smeeb


People also ask

How do you create a DataFrame from a string in PySpark?

You can manually create a PySpark DataFrame using toDF() and createDataFrame() methods, both these function takes different signatures in order to create DataFrame from existing RDD, list, and DataFrame.

How do you create a dummy DataFrame in PySpark?

In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField . Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.

How do I create a Spark DataFrame with column names?

To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.


1 Answers

Try:

sqlContext.sparkContext.parallelize(rawData).toDF()

In 2.0 you can:

import spark.implicits._

rawData.toDF

Optionally provide a sequence of names for toDF:

sqlContext.sparkContext.parallelize(rawData).toDF("fizz")
like image 96
2 revsuser6022341 Avatar answered Oct 13 '22 02:10

2 revsuser6022341