Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using CSV Serde with Hive create table converts all field types to string

If I create a table and specify a CSVSerde then all fields are being converted to string type.

hive> create table foo(a int, b double, c string) row format serde 'com.bizo.hive.serde.csv.CSVSerde' stored as textfile; OK Time taken: 0.22 seconds hive> describe foo; OK a string from deserializer b string from deserializer c string from deserializer Time taken: 0.063 seconds, Fetched: 3 row(s) That Serde is from https://github.com/ogrodnek/csv-serde

If I try the serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' from this page https://cwiki.apache.org/confluence/display/Hive/CSV+Serde I saw the same thing. All fields are being changed to type string.

Hive version 1.2.1 Hadoop version 2.7.0 java version "1.7.0_80"

like image 960
awhan Avatar asked Jul 04 '15 13:07

awhan


People also ask

What is row format SerDe in hive?

The SerDe interface allows you to instruct Hive about how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer. Hive uses SerDe (and FileFormat) to read and write the table's row.

How do you write SerDe in hive?

SERDE is a combination of Serializer and Deserializer i.e SERIALIZER + DESERIALIZER = SERDE. SERDE is popularly used to load from sources storing data in JSON format. CREATE TABLE order_json ( order_id INT, order_date STRING, cust_id STRING, order_status STRING ) ROW FORMAT SERDE 'org.

What is the default SerDe used in hive?

Hive stores data in files(tables) on HDFS and SerDe(Deserialization) comes into picture when you query data and when you want to insert more data SerDe(Serializaton) is used. Yes, currently this SerDes are used by Hive by default MetadataTypedColumnsetSerDe which is for CSV delimited files.

What is the SerDe architecture in hive?

SerDe Overview A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.


1 Answers

Yes the com.bizo.hive.serde.csv.CSVSerde only creates strings. This is how it was built and how it will always work. There is no option to change it. I think it is likely that this would work for the majority of your variables. That being said I would.

Use A SELECT statement using a regex-based column specification, which can be used in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none. This means you can quickly build a new table altering the types of the few variables you need to have as doubles or ints.

set hive.support.quoted.identifiers=none;

drop table if       exists database.table_name;
create table if not exists database.table_name as
select `(a|b|c)?+.+`
    , cast(a as double) as a
    , cast(b as double) as b
    , cast(c as double) as c
    from database.some_table

;

You can use this method to touch only the variables that need to be changed and minmimizing query length. You can create view on top of your table to query this way. Or you could create an external table and drop the old table;

like image 94
invoketheshell Avatar answered Oct 10 '22 00:10

invoketheshell