Using the hive or beeline client, I have no problem executing this statement:
hive -e "LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2"
The data from the file is loaded successfully into hive.
However, when using pyhs2 from the same machine, the file is not found:
import pyhs2
conn_str = {'authMechanism':'NOSASL', 'host':'azus',}
conn = pyhs2.connect(conn_str)
with conn.cursor() as cur:
cur.execute("LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2")
Throws exception:
Traceback (most recent call last):
File "data_access/hs2.py", line 38, in write
cur.execute("LOAD DATA LOCAL INPATH '%s' INTO TABLE %s" % (csv_file.name, table_name))
File "/edge/1/anaconda/lib/python2.7/site-packages/pyhs2/cursor.py", line 63, in execute
raise Pyhs2Exception(res.status.errorCode, res.status.errorMessage)
pyhs2.error.Pyhs2Exception: "Error while compiling statement: FAILED: SemanticException Line 1:23 Invalid path ''/tmp/tmpBKe_Mc'': No files matching path file:/tmp/tmpBKe_Mc"
I've seen similar questions posted about this problem, and the usual answer is that the query is running on a different server that doesn't have the local file '/tmp/tmpBKe_Mc' stored on it. However, if that is the case, why would running the command directly from the CLI work but using pyhs2 not work?
(Secondary question: how can I show which server is trying to handle the query? I've tried cur.execute("set"), which returns all configuration parameters but when grepping for "host" the returned parameters don't seem to contain a real hostname.)
Thanks!
This happens because pyhs2 trying to find file on cluster
Solution is to have your source saved in related hdfs location instead of /tmp
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With