Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python/Hive interface slow with fetchone(), hangs with fetchall()

Tags:

python

hive

I have a python script that is querying HiveServer2 using pyhs2, like so:

import pyhs2;
conn = pyhs2.connect(host=localhost,
               port=10000,
               user='user',
               password='password',
               database='default');

cur = conn.cursor();
cur.execute("SELECT name,data,number,time FROM table WHERE date = '2014-01-01' AND number in (1,5,6,22) ORDER BY name,time ASC");

line = cur.fetchone();
while line is not None:
   <do some processing, including writing to stdout>
   .
   .
   .
   line = cur.fetchone();

I have also tried using fetchall() instead of fetchone(), but that just seems to hang forever.

My query runs just fine and returns ~270 million rows. For testing, I dumped the output from Hive into a flat, tab-delimited file and wrote the guts of my python script against that, so I didn't have to wait for the query to finish everytime I ran. My script that reads the flat file will finish in ~20 minutes. What confuses me is that I don't see that same performance when I directly query Hive. In fact, it takes about 5 times longer to finish processing. I am pretty new to Hive, and python so maybe I am making some bone-headed error, but examples that I see online show a set up such as this. I just want to iterate through my Hive return, getting one row at a time as quickly as possible, much like I did using my flat file. Any suggestions?

P.S. I have found this question that sounds similar:

Python slow on fetchone, hangs on fetchall

but that ended up being a SQLite issue, and I have no control over my Hive set up.

like image 538
Jeff Hall Avatar asked Aug 11 '14 16:08

Jeff Hall


1 Answers

Have you considered using fetchmany().

That would be the DBAPI answer for pulling data in chunks (bigger one, where the overhead is an issue, and smaller than all rows, where memory is an issue).

like image 58
yacc143 Avatar answered Oct 26 '22 23:10

yacc143