Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write pandas table to impala

Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file.

However, I cannot find any information on how to do this, or even how to ssh into the impala shell and write the table from there.

What I'd like to do:

from impala.dbapi import connect
from impala.util import as_pandas

# connect to my host and port
conn=connect(host='myhost', port=111) 

# create query to save table as pandas df
create_query = """
     SELECT * FROM {}
     """.format(my_table_name)

# run query on impala
cur = conn.cursor()
cur.execute(create_query)

# store results as pandas data frame
pandas_df = as_pandas(cur)
cur.close()  

Once I've done whatever I need to do with pandas_df, save those results back to impala as a table.

# create query to save new_df back to impala
save_query = """
    CREATE TABLE new_table AS
        SELECT * 
        FROM pandas_df
    """

# run query on impala
cur = conn.cursor()
cur.execute(save_query)
cur.close()

The above scenario would be ideal, but I'd be happy if I could figure out how to ssh into impala-shell and do this from python, or even just save the table to hdfs. I'm writing this as a script for other users, so it's essential to have this all done within the script. Thanks so much!

like image 760
SummerEla Avatar asked Sep 01 '15 17:09

SummerEla


2 Answers

You're going to love Ibis! It has the HDFS functions (put, namely) and wraps the Impala DML and DDL you'll need to make this easy.

The general approach I've used for something similar is to save your pandas table to a CSV, HDFS.put that on to the cluster, and then create a new table using that CSV as the data source.

You don't need Ibis for this, but it should make it a little bit easier and may be a nice tool for you if you're already familiar with pandas (Ibis was also created by Wes, who wrote pandas).

like image 101
Isaac Avatar answered Nov 05 '22 04:11

Isaac


I am trying to do same thing and I figured out a way to do this with an example provided with impyla:

df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)

This works fine and table in impala (backend mysql) works fine.

However, I got stuck on getting text values in as impala tries to do analysis on columns and I get cast errors. (It would be really nice if possible to implicitly cast from string to [var]char(N) in impyla.)

like image 21
Sasha Kacanski Avatar answered Nov 05 '22 03:11

Sasha Kacanski