I am facing error when trying to create a DataFrame from an RDD.
My code:
from pyspark import SparkConf, SparkContext
from pyspark import sql
conf = SparkConf()
conf.setMaster('local')
conf.setAppName('Test')
sc = SparkContext(conf = conf)
print sc.version
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sql.SQLContext.createDataFrame(rdd, ["id", "score"]).collect()
print df
Error:
df = sql.SQLContext.createDataFrame(rdd, ["id", "score"]).collect()
TypeError: unbound method createDataFrame() must be called with SQLContext
instance as first argument (got RDD instance instead)
I accomplished the same task in spark shell where a straight forward last three lines of code will print the values. I mainly suspect the import statements because that is where the difference comes between IDE and Shell.
You need to use an instance of SQLContext. So you could try something like the following:
sqlContext = sql.SQLContext(sc)
df = sqlContext.createDataFrame(rdd, ["id", "score"]).collect()
More details in pyspark documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With