Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Data to Pandas Data frame

Newbie to Python.

How can i save the data from hive to Pandas data frame.

with pyhs2.connect(host, port=20000,authMechanism="PLAIN",user,password,
               database) as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute(query)

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i
        **columnNames = [a['columnName'] for a in  cur.getSchema()]
        print columnNames
        df1=pd.DataFrame(cur.fetch(),columnNames)**

Tried using column names. Didn't Work.

Pls. suggest something.

like image 956
ankita gupta Avatar asked Jul 06 '16 07:07

ankita gupta


2 Answers

pd.read_sql() (pandas 0.24.0) takes a DB connection. Use PyHive connection directly with pandas.read_sql() as follows:

from pyhive import hive
import pandas as pd

# open connection
conn = hive.Connection(host=host,port= 20000, ...)

# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)

Dataframe's columns will be named after the hive table's. One can change them during/after dataframe creation if needed:

  • via HiveQL: SELECT id AS new_column_name ...
  • via columns attribute in pd.read_sql()
like image 104
Saftography Avatar answered Sep 16 '22 21:09

Saftography


You can try this: ( I'm pretty sure it will work)

res = cur.getSchema()
description = list(col['columnName'] for col in res)  ## for getting the column names of the table 

headers = [x.split(".")[1] for x in description] # for splitting the list if the column name contains a period

df= pd.DataFrame(cur.fetchall(), columns = headers)

df.head(n = 20)
like image 28
ML_Passion Avatar answered Sep 19 '22 21:09

ML_Passion