Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to insert data into Parquet table in Hive

I have simple text table (delimited by ",") with the following format:

orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING

I want to insert this data into a Parquet table: I've created the table using:

CREATE TABLE parquet_test (orderID INT, CustID INT, OrderTotal FLOAT, 
OrderNumItems INT, OrderDesc STRING) 
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' stored as 
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' 
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';

but when I’m trying to insert the data using

insert overwrite table parquet_small_orders select * from small_orders;

it fails. any thoughts?

like image 392
Alex L Avatar asked Apr 18 '14 08:04

Alex L


People also ask

How do I load Parquet data?

Loading a Parquet data file to the Snowflake Database table is a two-step process. First, using PUT command upload the data file to Snowflake Internal stage. Second, using COPY INTO , load the file from the internal stage to the Snowflake table.

Does Hive work with Parquet?

Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.


1 Answers

I will share what I just tested and it is working perfectly fine. I know it might be a version related problem which Hive could've solved at this time. I will explain what I just did anyways,

1. Checking raw data

$ cat myFile.txt

orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING
100,1000,100000,5,"aa"
200,2000,200000,6,"bb"
300,3000,300000,7,"cc"
400,4000,400000,8,"dd"
500,5000,500000,9,"ee"

2. Mounting a Hive table pointing to the File

Create external table myDB.orders(orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING)
row format 
delimited FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE 
LOCATION '/my/path/to/csv'
tblproperties ("skip.header.line.count"="1");

4. Checking Hive table works correctly. Note that I added a sentence to ignore the first line, which is often is the header of the csv file.

select * from myDB.orders;

100,1000,100000.0,5,"aa" 
200,2000,200000.0,6,"bb" 
300,3000,300000.0,7,"cc" 
400,4000,400000.0,8,"dd" 
500,5000,500000.0,9,"ee"

5. Mounting Parquet table:

CREATE TABLE myDB.parquet_test (orderID INT, CustID INT, OrderTotal FLOAT,
OrderNumItems INT, OrderDesc STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' stored as
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
location '/my/path/to/parquet';

6. Inserting data from CSV Hive table to Hive Parquet table (Converting to parquet in the process)

insert overwrite table myDB.parquet_test select * from myDB.orders;

Finally, I double checked the data transformation was correct doing a simple query to myDB.parquet_test. To be 100% sure the data is in Parquet, I went to /my/path/to/parquet and I double checked that those files are actually in Parquet. Not sure if your problem is because you are not skipping maybe the header of the file or maybe the version you are using, but again, these steps I just explained works well.

like image 144
dbustosp Avatar answered Oct 08 '22 01:10

dbustosp