Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query HIVE table in pyspark

Tags:

pyspark

hive

I am using CDH5.5

I have a table created in HIVE default database and able to query it from the HIVE command.

Output

hive> use default;  OK  Time taken: 0.582 seconds   hive> show tables;  OK  bank Time taken: 0.341 seconds, Fetched: 1 row(s)  hive> select count(*) from bank;  OK  542  Time taken: 64.961 seconds, Fetched: 1 row(s) 

However, I am unable to query the table from pyspark as it cannot recognize the table.

from pyspark.context import SparkContext  from pyspark.sql import HiveContext  sqlContext = HiveContext(sc)   sqlContext.sql("use default")  DataFrame[result: string]  sqlContext.sql("show tables").show()  +---------+-----------+  |tableName|isTemporary|  +---------+-----------+  +---------+-----------+   sqlContext.sql("FROM bank SELECT count(*)")  16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*) 16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed Traceback (most recent call last):     File "<stdin>", line 1, in <module>     File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql       return DataFrame(self._ssql_ctx.sql(sqlQuery), self)     File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",   line 538, in __call__     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco       raise AnalysisException(s.split(': ', 1)[1])   **pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5** 

New Error

>>> from pyspark.sql import HiveContext >>> hive_context = HiveContext(sc) >>> bank = hive_context.table("default.bank") 16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table     return DataFrame(self._ssql_ctx.table(tableName), self)   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco     return f(*a, **kw)   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o22.table. : org.apache.spark.sql.catalyst.analysis.NoSuchTableException     at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)     at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)     at scala.Option.getOrElse(Option.scala:120)     at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)     at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)     at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)     at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)     at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)     at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)     at scala.Option.getOrElse(Option.scala:120)     at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)     at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)     at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)     at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:606)     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)     at py4j.Gateway.invoke(Gateway.java:259)     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)     at py4j.commands.CallCommand.execute(CallCommand.java:79)     at py4j.GatewayConnection.run(GatewayConnection.java:207)     at java.lang.Thread.run(Thread.java:745) 

thanks

like image 587
Chn Avatar asked Mar 17 '16 03:03

Chn


People also ask

Does Pyspark use Hive?

Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically.


1 Answers

We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One way to read Hive table in pyspark shell is:

from pyspark.sql import HiveContext hive_context = HiveContext(sc) bank = hive_context.table("default.bank") bank.show() 

To run the SQL on the hive table: First, we need to register the data frame we get from reading the hive table. Then we can run the SQL query.

bank.registerTempTable("bank_temp") hive_context.sql("select * from bank_temp").show() 
like image 68
bijay697 Avatar answered Sep 26 '22 21:09

bijay697