Query HIVE table in pyspark

Tags:

pyspark

hive

I am using CDH5.5

I have a table created in HIVE default database and able to query it from the HIVE command.

Output

hive> use default;  OK  Time taken: 0.582 seconds   hive> show tables;  OK  bank Time taken: 0.341 seconds, Fetched: 1 row(s)  hive> select count(*) from bank;  OK  542  Time taken: 64.961 seconds, Fetched: 1 row(s)

However, I am unable to query the table from pyspark as it cannot recognize the table.

from pyspark.context import SparkContext  from pyspark.sql import HiveContext  sqlContext = HiveContext(sc)   sqlContext.sql("use default")  DataFrame[result: string]  sqlContext.sql("show tables").show()  +---------+-----------+  |tableName|isTemporary|  +---------+-----------+  +---------+-----------+   sqlContext.sql("FROM bank SELECT count(*)")  16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*) 16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed Traceback (most recent call last):     File "<stdin>", line 1, in <module>     File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql       return DataFrame(self._ssql_ctx.sql(sqlQuery), self)     File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",   line 538, in __call__     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco       raise AnalysisException(s.split(': ', 1)[1])   **pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**

New Error

>>> from pyspark.sql import HiveContext >>> hive_context = HiveContext(sc) >>> bank = hive_context.table("default.bank") 16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table     return DataFrame(self._ssql_ctx.table(tableName), self)   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco     return f(*a, **kw)   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o22.table. : org.apache.spark.sql.catalyst.analysis.NoSuchTableException     at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)     at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)     at scala.Option.getOrElse(Option.scala:120)     at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)     at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)     at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)     at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)     at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)     at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)     at scala.Option.getOrElse(Option.scala:120)     at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)     at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)     at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)     at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:606)     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)     at py4j.Gateway.invoke(Gateway.java:259)     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)     at py4j.commands.CallCommand.execute(CallCommand.java:79)     at py4j.GatewayConnection.run(GatewayConnection.java:207)     at java.lang.Thread.run(Thread.java:745)

thanks

587

asked Mar 17 '16 03:03

Chn

1 Answers

We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One way to read Hive table in pyspark shell is:

from pyspark.sql import HiveContext hive_context = HiveContext(sc) bank = hive_context.table("default.bank") bank.show()

To run the SQL on the hive table: First, we need to register the data frame we get from reading the hive table. Then we can run the SQL query.

bank.registerTempTable("bank_temp") hive_context.sql("select * from bank_temp").show()

answered Sep 26 '22 21:09

bijay697

Related questions
                            
                                SparkSQL - Read parquet file directly
                            
                                select rows in sql with latest date for each ID repeated multiple times [duplicate]
                            
                                How to make shark/spark clear the cache?
                            
                                When creating an external table in hive can I point the location to specific files in a directory?
                            
                                How to create SparkSession with Hive support (fails with "Hive classes are not found")?
                            
                                How to view the value of a hive variable?
                            
                                How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
                            
                                Hive: writing column headers to local file?
                            
                                Compress file on S3
                            
                                Display the SQL definition of a hive view
                            
                                How to update partition metadata in Hive , when partition data is manualy deleted from HDFS
                            
                                hive sql find the latest record
                            
                                What is hive, Is it a database? [closed]
                            
                                How to load a text file into a Hive table stored as sequence files
                            
                                Can I change a table from internal to external in hive?
                            
                                how to replace characters in hive?
                            
                                COALESCE with Hive SQL
                            
                                Hive cast string to date dd-MM-yyyy
                            
                                Spark final task takes 100x times longer than first 199, how to improve
                            
                                Add a column in a table in HIVE QL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With