I am using CDH5.5
I have a table created in HIVE default database and able to query it from the HIVE command.
Output
hive> use default; OK Time taken: 0.582 seconds hive> show tables; OK bank Time taken: 0.341 seconds, Fetched: 1 row(s) hive> select count(*) from bank; OK 542 Time taken: 64.961 seconds, Fetched: 1 row(s)
However, I am unable to query the table from pyspark as it cannot recognize the table.
from pyspark.context import SparkContext from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql("use default") DataFrame[result: string] sqlContext.sql("show tables").show() +---------+-----------+ |tableName|isTemporary| +---------+-----------+ +---------+-----------+ sqlContext.sql("FROM bank SELECT count(*)") 16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*) 16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql return DataFrame(self._ssql_ctx.sql(sqlQuery), self) File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco raise AnalysisException(s.split(': ', 1)[1]) **pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**
New Error
>>> from pyspark.sql import HiveContext >>> hive_context = HiveContext(sc) >>> bank = hive_context.table("default.bank") 16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table return DataFrame(self._ssql_ctx.table(tableName), self) File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o22.table. : org.apache.spark.sql.catalyst.analysis.NoSuchTableException at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123) at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123) at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745)
thanks
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically.
We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One way to read Hive table in pyspark shell is:
from pyspark.sql import HiveContext hive_context = HiveContext(sc) bank = hive_context.table("default.bank") bank.show()
To run the SQL on the hive table: First, we need to register the data frame we get from reading the hive table. Then we can run the SQL query.
bank.registerTempTable("bank_temp") hive_context.sql("select * from bank_temp").show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With