Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:

insert overwrite directory '/home/output.csv' select books from table; 

When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?

like image 263
AAA Avatar asked Aug 08 '13 15:08

AAA


People also ask

How do I export hive query results?

To directly save the file in HDFS, use the below command: hive> insert overwrite directory '/user/cloudera/Sample' row format delimited fields terminated by '\t' stored as textfile select * from table where id >100; This will put the contents in the folder /user/cloudera/Sample in HDFS. Show activity on this post.

How do I run a hive query from a file?

Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so: hive -e 'select books from table' > /home/lvermeer/temp.tsv That gives me a tab-separated file that I can use. Hope that is useful for you as well.

Can I use a CSV file for query results?

The file has been correctly saved to the specified CSV file if all worked well. It can be used for input into another tool, such as Tableau, for data visualization, or even Excel. It could also be imported into another database, for example. Writing SQL query results to a JSON file is as simple as writing them to a CSV file.

How to edit hive query results in hue?

Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS. I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.

Can I run multiple hive queries in parallel and store output?

I would like to run multiple Hive queries, preferably in parallel rather than sequentially, and store the output of each query into a csv file. For example, query1 output in csv1, query2 output in csv2, etc. I would be running these queries after leaving work with the goal of having output to analyze during the next business day.


1 Answers

Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE does, then I'll describe the method I use to get tsv files from Hive tables.

According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.

Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.

A slight modification (adding the LOCAL keyword) will store the data in a local directory.

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table; 

When I run a similar query, here's what the output looks like.

[lvermeer@hadoop temp]$ ll total 4 -rwxr-xr-x 1 lvermeer users 811 Aug  9 09:21 000000_0 [lvermeer@hadoop temp]$ head 000000_0  "row1""col1"1234"col3"1234FALSE "row2""col1"5678"col3"5678TRUE 

Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:

hive -e 'select books from table' > /home/lvermeer/temp.tsv 

That gives me a tab-separated file that I can use. Hope that is useful for you as well.

Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'  ROW FORMAT DELIMITED  FIELDS TERMINATED BY ','  select books from table; 

Hope that helps.

like image 59
Lukas Vermeer Avatar answered Oct 09 '22 13:10

Lukas Vermeer