When registering a table using the %pyspark interpreter in Zeppelin, I can't access the table in %sql

I am using Zeppelin 0.5.5. I found this code/sample here for python as I couldn't get my own to work with %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/. I have a feeling his %pyspark example worked because if you using the original %spark zeppelin tutorial the "bank" table is already created.

This code is in a notebook.

%pyspark
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(),     False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")

This code is in the same notebook but different work pad.

%sql 
SELECT count(1) FROM bank

org.apache.spark.sql.AnalysisException: no such table bank; line 1 pos 21
...

How do you reset the spark interpreter on Zeppelin?

You can restart the interpreter for the notebook in the interpreter bindings (gear in upper right hand corner) by clicking on the restart icon to the left of the interpreter in question (in this case it would be the spark interpreter).

Which of the following is an interpreter built into Apache Zeppelin?

Interpreters in zeppelin Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.

I found the problem to this issue. Prior to 0.6.0 the sqlContext variable is sqlc in %pyspark.

Defect can be found here: https://issues.apache.org/jira/browse/ZEPPELIN-134

In Pyspark, the SQLContext is currently available in the variable name sqlc. This is incosistent with the documentation and with the variable name in scala which is sqlContext.

sqlContext can be used as a variable for the SQLContext, in addition to sqlc (for backward compatibility)

Related code: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66

The suggested workaround is simply to do the following in your %pyspark script

sqlContext = sqlc

Found here:

https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.com%3E

Instead of sqlContext, use sqlc and replace registerAsTable by sqlc.registerDataFrameAsTable

%pyspark
from os import getcwd
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlc.createDataFrame(bank,bankSchema)
sqlc.registerDataFrameAsTable(bankdf, "bank")


%sql 
SELECT count(1) FROM bank

When registering a table using the %pyspark interpreter in Zeppelin, I can't access the table in %sql

Tags:

apache-spark-sql

apache-zeppelin

Kevin Vasko

People also ask

2 Answers

Kevin Vasko

user4720308

Recent Activity

Donate For Us

When registering a table using the %pyspark interpreter in Zeppelin, I can't access the table in %sql

Tags:

apache-spark-sql

apache-zeppelin

Kevin Vasko

People also ask

2 Answers

Kevin Vasko

user4720308

Related questions

Recent Activity

Donate For Us