Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When registering a table using the %pyspark interpreter in Zeppelin, I can't access the table in %sql

I am using Zeppelin 0.5.5. I found this code/sample here for python as I couldn't get my own to work with %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/. I have a feeling his %pyspark example worked because if you using the original %spark zeppelin tutorial the "bank" table is already created.

This code is in a notebook.

%pyspark
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(),     False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")

This code is in the same notebook but different work pad.

%sql 
SELECT count(1) FROM bank

org.apache.spark.sql.AnalysisException: no such table bank; line 1 pos 21
...
like image 353
Kevin Vasko Avatar asked Nov 23 '15 22:11

Kevin Vasko


People also ask

How do you reset the spark interpreter on Zeppelin?

You can restart the interpreter for the notebook in the interpreter bindings (gear in upper right hand corner) by clicking on the restart icon to the left of the interpreter in question (in this case it would be the spark interpreter).

Which of the following is an interpreter built into Apache Zeppelin?

Interpreters in zeppelin Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.


2 Answers

I found the problem to this issue. Prior to 0.6.0 the sqlContext variable is sqlc in %pyspark.

Defect can be found here: https://issues.apache.org/jira/browse/ZEPPELIN-134

In Pyspark, the SQLContext is currently available in the variable name sqlc. This is incosistent with the documentation and with the variable name in scala which is sqlContext.

sqlContext can be used as a variable for the SQLContext, in addition to sqlc (for backward compatibility)

Related code: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66

The suggested workaround is simply to do the following in your %pyspark script

sqlContext = sqlc

Found here:

https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.com%3E

like image 168
Kevin Vasko Avatar answered Sep 30 '22 15:09

Kevin Vasko


Instead of sqlContext, use sqlc and replace registerAsTable by sqlc.registerDataFrameAsTable

%pyspark
from os import getcwd
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlc.createDataFrame(bank,bankSchema)
sqlc.registerDataFrameAsTable(bankdf, "bank")


%sql 
SELECT count(1) FROM bank
like image 32
user4720308 Avatar answered Sep 30 '22 15:09

user4720308