How to connect spark with hive using pyspark?

Tags:

I am trying to read hive tables using pyspark, remotely. It states the error that it is unable to connect to Hive Metastore client.

I have read multiple answers on SO and other sources, they were mostly configurations but none of them could address why am I unable to connect remotely. I read the documentation and observed that without making changes in any configuration file, we can connect spark with hive. Note: I have port-forwarded a machine where hive is running and brought it available to localhost:10000. I even connected the same using presto and was able to run queries on hive.

The code is:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .enableHiveSupport()
                .getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')

I expect the output to be an acknowledgment of table being saved but instead, I am facing this error.

Abstract error is:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

I have fired a command:

ssh -i ~/.ssh/id_rsa_sc -L 9000:A.B.C.D:8080 -L 9083:E.F.G.H:9083 -L 10000:E.F.G.H:10000 [email protected]

When I check for ports 10000 and 9083 via the commands:

aviral@versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral@versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!

Upon running the script, I get the following error:

Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
    ... 45 more

925

asked Mar 25 '19 13:03

Aviral Srivastava

1 Answers

The catch is in letting the hive configs being stored while creating the spark session itself.

sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
                .enableHiveSupport()
                .getOrCreate()
                )

It should be noted that no changes in spark conf are required, even serverless services like AWS Glue can have such connections.

For full code:

from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""

sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
                .enableHiveSupport()
                .getOrCreate()
                )
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')

df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())

138

answered Oct 02 '22 12:10

Aviral Srivastava

Related questions
                            
                                add contact with telethon in python
                            
                                Python: How to create and use a custom logger in python use logging module?
                            
                                Secondary axis for Holoviews(Bokeh) graph
                            
                                Sharing a counter with multiprocessing.Pool
                            
                                Use of Breakpoint Method
                            
                                When I init a dag with a Variable param, it raises an Exception
                            
                                Variable scopes inside class definitions are confusing
                            
                                How to replace special characters within a text with a space in Python?
                            
                                Highlighting specific text in an image using python
                            
                                How can I select a random object from a class?
                            
                                OpenCV 4 TypeError: Expected cv::UMat for argument 'labels'
                            
                                Python subprocess.call with timeout retry
                            
                                prevent flask reload on change
                            
                                How to get a character from its UTF-16 code points in Python 3?
                            
                                How to set up Anaconda so that it doesn't affect other environments like 'homebrew python pip' and Pyenv on MacOS?
                            
                                Access elements of a Matrix by a list of indices in Python to apply a max(val, 0.5) to each value without a for loop
                            
                                Not able to install http module in python 3.7
                            
                                How to dynamically remove a decorator from a function?
                            
                                Pandas: How to read specific rows from a CSV file
                            
                                Many2one res.partner Filter in CRM module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to connect spark with hive using pyspark?

Tags:

python-3.x

pyspark

pyspark-sql

hive

thrift-protocol

Aviral Srivastava

People also ask

1 Answers

Aviral Srivastava

Recent Activity

Donate For Us