Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Access Hive via Python?

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.

When I add this to /etc/profile:

export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py 

I can then do the imports as listed in the link, with the exception of from hive import ThriftHive which actually need to be:

from hive_service import ThriftHive 

Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.

So I set it up like so:

from thrift import Thrift from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol try:     transport = TSocket.TSocket('<node-with-metastore>', 9083)     transport = TTransport.TBufferedTransport(transport)     protocol = TBinaryProtocol.TBinaryProtocol(transport)     client = ThriftHive.Client(protocol)     transport.open()     client.execute("CREATE TABLE test(c1 int)")      transport.close() except Thrift.TException, tx:     print '%s' % (tx.message) 

I received the following error:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute self.recv_execute() File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute raise x thrift.Thrift.TApplicationException: Invalid method name: 'execute' 

But inspecting the ThriftHive.py file reveals the method execute within the Client class.

How may I use Python to access Hive?

like image 301
Matthew Moisen Avatar asked Jan 26 '14 23:01

Matthew Moisen


1 Answers

I believe the easiest way is to use PyHive.

To install you'll need these libraries:

pip install sasl pip install thrift pip install thrift-sasl pip install PyHive 

Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case.

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install in Terminal)

After installation, you can connect to Hive like this:

from pyhive import hive conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU") 

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

cursor = conn.cursor() cursor.execute("SELECT cool_stuff FROM hive_table") for result in cursor.fetchall():   use_result(result) 

...or to use the connection to make a Pandas dataframe:

import pandas as pd df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn) 
like image 147
Tristan Reid Avatar answered Oct 14 '22 18:10

Tristan Reid