Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom SerDe not supported by Impala, what's the best way to query files in CSV w/double quotes?

I have a CSV data with each field surronded with double quotes. When I created Hive table used serde 'com.bizo.hive.serde.csv.CSVSerde' When above table is queried in Impala I am getting error SerDe not found.

I added the CSV Serde JAR file in /usr/lib/impala/lib folder.

Later studied in Impala documentation that Impala does not support custom SERDE. In such case how I can overcome this issue such that my CSV data with quotes is taken care. I want to use CSV Serde because it takes of commas in values which is a legitimate field vavlue.

Thanks a lot

like image 224
prasannads Avatar asked Sep 03 '14 10:09

prasannads


2 Answers

Can you use Hive? If so, here is an approach that might work. CREATE your table as an EXTERNAL TABLE in Hive and use your SERDE in the right place of the CREATE Statement (I think you need something like ROW FORMAT SERDE your_serde_here at the end of the CREATE TABLE statement). Before this you might need to do:

ADD JAR 'hdfs:///path/to/your_serde.jar' 

Note that the jar should be somewhere in hdfs and triple /// needed for it to work...

Then, still in Hive, duplicate the table into another table that is stored in a format with which Impala can easily work, such as PARQUET. Something like the following does this copying:

CREATE TABLE copy_of_table 
   STORED AS PARQUET AS
   SELECT * FROM your_original_table

Now in Impala use INVALIDATE METADATA to mark the metadata as stale:

INVALIDATE METADATA copy_of_table

You should be all set to happily work with copy_of_table in Impala now.

Let me know whether this works, as I might have do to something like this in the near future.

like image 144
Mateo Avatar answered Sep 28 '22 06:09

Mateo


Within Hive

CREATE TABLE mydb.my_serde_table_impala AS SELECT FROM mydb.my_serde_table

Within Impala

INVALIDATE METADATA mydb.my_serde_table_impala

Add these steps to include dropping the _impala table first with whatever populates or ingests files for the serde table.

Impala bypasses MapReduce, unlike Hive. So Impala can't/doesn't use the SerDe the way MapReduce does.

like image 45
hrobertv Avatar answered Sep 28 '22 04:09

hrobertv