Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incorrect query encoding when executing Hive from file

I have a Hive query with CJK characters in a file like below:

SELECT * FROM tbl WHERE name LIKE '日本語%';

And the file is encoded in UTF-8:

> file -bi query.hql
text/plain; charset=utf-8

If I execute it with Hive CLI, I can get expected result:

> /path/to/hive -f query.hql
some results here

Now I want to execute this query from Java. So I wrote some code like:

String[] cmd = new String[]{"/bin/bash", "/my/script", "/path/to/query.hql", "/path/to/output.txt"};
ProcessBuilder pb = new ProcessBuilder(cmd);
...
pb.start();
...

And /my/script looks like:

HQL_FILE=$1
OUTPUT_FILE=$2
/path/to/hive -f "${HQL_FILE}" > "${OUTPUT_FILE}"

I ran my Java program but got no output. I checked Hive log file and it looks like an encoding issue.

If I run hive -f query.hql via shell, the CJK text logged correctly in hive log:

> cat /tmp/myuser/hive.log
2016-02-29 11:27:40,303 INFO  [main]: parse.ParseDriver (ParseDriver.java:parse(185)) - Parsing command: ... name LIKE '日本語%' ...

But if I run via above Java program, the log looks strange

> cat /tmp/myuser/hive.log
2016-02-29 11:29:41,104 INFO  [main]: parse.ParseDriver (ParseDriver.java:parse(185)) - Parsing command: ... name LIKE '???????%' ...

I've been investigating this problem for half day but could not find any useful information.

I appreciate if anyone can give me some advice.

PS:

  1. Hive Server is not an option. I have to invoke hive client via shell.
  2. I'm using Hive 0.14.0.
like image 766
kuang Avatar asked Mar 23 '26 04:03

kuang


1 Answers

Assuming that the Java program isn't writing the hql file itself, in the shell where the hive command works, run this command:

echo $LANG

You'll probably get something like en_US.UTF-8.

Take whatever value you get and modify your Java program to have this after you create the ProcessBuilder:

pb.environment().put("LANG", "en_US.UTF-8");

(Use whatever value you got instead of en_US.UTF-8)

If your Java program is writing the hql file itself, then there's something else to worry about too: when you open the file, you should specify UTF-8 encoding for output. How to do that will depend a bit on how you're opening the file.

like image 128
Daniel Martin Avatar answered Mar 24 '26 18:03

Daniel Martin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!