I have simple text table (delimited by ",") with the following format:
orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING
I want to insert this data into a Parquet table: I've created the table using:
CREATE TABLE parquet_test (orderID INT, CustID INT, OrderTotal FLOAT,
OrderNumItems INT, OrderDesc STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' stored as
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';
but when I’m trying to insert the data using
insert overwrite table parquet_small_orders select * from small_orders;
it fails. any thoughts?
Loading a Parquet data file to the Snowflake Database table is a two-step process. First, using PUT command upload the data file to Snowflake Internal stage. Second, using COPY INTO , load the file from the internal stage to the Snowflake table.
Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.
I will share what I just tested and it is working perfectly fine. I know it might be a version related problem which Hive could've solved at this time. I will explain what I just did anyways,
1. Checking raw data
$ cat myFile.txt
orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING
100,1000,100000,5,"aa"
200,2000,200000,6,"bb"
300,3000,300000,7,"cc"
400,4000,400000,8,"dd"
500,5000,500000,9,"ee"
2. Mounting a Hive table pointing to the File
Create external table myDB.orders(orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING)
row format
delimited FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/my/path/to/csv'
tblproperties ("skip.header.line.count"="1");
4. Checking Hive table works correctly. Note that I added a sentence to ignore the first line, which is often is the header of the csv file.
select * from myDB.orders;
100,1000,100000.0,5,"aa"
200,2000,200000.0,6,"bb"
300,3000,300000.0,7,"cc"
400,4000,400000.0,8,"dd"
500,5000,500000.0,9,"ee"
5. Mounting Parquet table:
CREATE TABLE myDB.parquet_test (orderID INT, CustID INT, OrderTotal FLOAT,
OrderNumItems INT, OrderDesc STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' stored as
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
location '/my/path/to/parquet';
6. Inserting data from CSV Hive table to Hive Parquet table (Converting to parquet in the process)
insert overwrite table myDB.parquet_test select * from myDB.orders;
Finally, I double checked the data transformation was correct doing a simple query to myDB.parquet_test
.
To be 100% sure the data is in Parquet, I went to /my/path/to/parquet
and I double checked that those files are actually in Parquet.
Not sure if your problem is because you are not skipping maybe the header of the file or maybe the version you are using, but again, these steps I just explained works well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With