Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to load CSV data into a "binary" Hive format?

Tags:

hive

I am wondering if there is any way to actually load CSV data into a binary Hive format - i.e. doing the same as data loading in a relational database would do: parsing and type converting the input and storing it in a binary format (in another binary file in case of Hive). The Hive reference says that the load data inpath command does not do "any transformation" so I suspect that types are not converted, e.g., from string to integer. I was reading about the OCR and RCFile formats but I was not able to find out if, e.g., string values from the CSV are type converted into machine integer values and stored in HDFS. Is that the case? What other possibilities are there to create binary representations of CSV files in Hive?

On a related note: I suspect Hive does convert string values into machine representations during query processing and is not, e.g., comparing string values - is this assumption right?

like image 354
muehlbau Avatar asked May 06 '13 15:05

muehlbau


1 Answers

By default, Hive just stores files as plain text files and stores records as plain text, all uncompressed. It does use ASCII 0x1 for a field separator which is more convenient than a comma for some inputs, but I'm sure you've worked out how to get Hive to work with comma separated values. If you want Hive to use a different file format, serialize/deserialize differently, or compress the data you have a few different options to play around with.

Out of the box, Hive supports several different file formats: TEXTFILE, SEQUENCEFILE, and RCFILE. The differences between have to do with how files are read, split, and written. TEXTFILE is the default and operates on normal text files. SEQUENCEFILE is a binary key-value pair format which is easily consumed by other parts of the Hadoop ecosystem. And RCFILE is a column oriented way to save Hive tables. In addition to this file formats, you can write your own or find ones other people have written to meet different needs.

In addition to the file format your data in saved in, you can decide how records in a table should be serialized and deserialized by specifying a SerDe. Hive 0.9.1 and above comes packed with an AvroSerDe, and Avro saves data in a binary format (it also has a schema itself which introduces some complications). A Google search for "hive binary SerDe" revealed a LazyBinarySerde which sounds like a more straightforward way of saving in a binary format. And if you can't find anything to fit you needs, you can always write your own SerDe.

I imagine your question fits into the large context of how to make Hive tables smaller and/or more performant. To this end you can apply compression on top of everything that I have mentioned above. To accomplish this simply tell Hive to compress it's output and tell it which codec to compress using the following commands:

hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

You can change this in your config files if you want these setting to persist outside the session (including other peoples Hive and MapReduce jobs if you are sharing a cluster). I use SnappyCodec because it works with Hive out of the box, is splittable, and gives good compression/decompression for the CPU time spent. You might decide a different codec is more suitable to your needs.

Now how do you apply all these options if all your data is in a CSV format? The easiest way is to create a table on top of the CSV files, then create another table with the fileformat and SerDe you want, then insert the data from the CSV backed table into the new table (making sure that you are compression your Hive output with your codec of choice). Under the hood, Hive will take care of reading the data from one format (CSV) and writing to another (whatever you decided). After this you will have a duplicate of the data and you can drop the CSV files if you desire.

CREATE EXTERNAL TABLE csv_table (id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION /user/muehlbau/yourData;

CREATE TABLE binary_table (id INT, name STRING)
ROW FORMAT SERDE org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
STORED AS SEQUENCEFILE;

set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

INSERT OVERWRITE TABLE binary_table
SELECT * FROM csv_table

The example above demonstrates how you could take advantage of all the options available to you, but do not think of it as a default, reasonable use case. Read up on the different file formats / SerDes / compression codecs and do some performance testing to settle on your approach.

like image 133
Daniel Koverman Avatar answered Oct 01 '22 13:10

Daniel Koverman