I am using Zeppelin 0.5.5. I found this code/sample here for python as I couldn't get my own to work with %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/. I have a feeling his %pyspark example worked because if you using the original %spark zeppelin tutorial the "bank" table is already created.
This code is in a notebook.
%pyspark
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")
bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])
bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))
bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")
This code is in the same notebook but different work pad.
%sql
SELECT count(1) FROM bank
org.apache.spark.sql.AnalysisException: no such table bank; line 1 pos 21
...
You can restart the interpreter for the notebook in the interpreter bindings (gear in upper right hand corner) by clicking on the restart icon to the left of the interpreter in question (in this case it would be the spark interpreter).
Interpreters in zeppelin Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
I found the problem to this issue. Prior to 0.6.0 the sqlContext variable is sqlc in %pyspark.
Defect can be found here: https://issues.apache.org/jira/browse/ZEPPELIN-134
In Pyspark, the SQLContext is currently available in the variable name sqlc. This is incosistent with the documentation and with the variable name in scala which is sqlContext.
sqlContext can be used as a variable for the SQLContext, in addition to sqlc (for backward compatibility)
Related code: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66
The suggested workaround is simply to do the following in your %pyspark script
sqlContext = sqlc
Found here:
https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.com%3E
Instead of sqlContext, use sqlc and replace registerAsTable by sqlc.registerDataFrameAsTable
%pyspark
from os import getcwd
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")
bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])
bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))
bankdf = sqlc.createDataFrame(bank,bankSchema)
sqlc.registerDataFrameAsTable(bankdf, "bank")
%sql
SELECT count(1) FROM bank
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With