How to insert data into Parquet table in Hive

Tags:

I have simple text table (delimited by ",") with the following format:

orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING

I want to insert this data into a Parquet table: I've created the table using:

CREATE TABLE parquet_test (orderID INT, CustID INT, OrderTotal FLOAT, 
OrderNumItems INT, OrderDesc STRING) 
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' stored as 
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' 
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';

but when I’m trying to insert the data using

insert overwrite table parquet_small_orders select * from small_orders;

it fails. any thoughts?

392

asked Apr 18 '14 08:04

Alex L

1 Answers

I will share what I just tested and it is working perfectly fine. I know it might be a version related problem which Hive could've solved at this time. I will explain what I just did anyways,

1. Checking raw data

$ cat myFile.txt

orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING
100,1000,100000,5,"aa"
200,2000,200000,6,"bb"
300,3000,300000,7,"cc"
400,4000,400000,8,"dd"
500,5000,500000,9,"ee"

2. Mounting a Hive table pointing to the File

Create external table myDB.orders(orderID INT, CustID INT, OrderTotal FLOAT, OrderNumItems INT, OrderDesc STRING)
row format 
delimited FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE 
LOCATION '/my/path/to/csv'
tblproperties ("skip.header.line.count"="1");

4. Checking Hive table works correctly. Note that I added a sentence to ignore the first line, which is often is the header of the csv file.

select * from myDB.orders;

100,1000,100000.0,5,"aa" 
200,2000,200000.0,6,"bb" 
300,3000,300000.0,7,"cc" 
400,4000,400000.0,8,"dd" 
500,5000,500000.0,9,"ee"

5. Mounting Parquet table:

CREATE TABLE myDB.parquet_test (orderID INT, CustID INT, OrderTotal FLOAT,
OrderNumItems INT, OrderDesc STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' stored as
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
location '/my/path/to/parquet';

6. Inserting data from CSV Hive table to Hive Parquet table (Converting to parquet in the process)

insert overwrite table myDB.parquet_test select * from myDB.orders;

Finally, I double checked the data transformation was correct doing a simple query to myDB.parquet_test. To be 100% sure the data is in Parquet, I went to /my/path/to/parquet and I double checked that those files are actually in Parquet. Not sure if your problem is because you are not skipping maybe the header of the file or maybe the version you are using, but again, these steps I just explained works well.

144

answered Oct 08 '22 01:10

dbustosp

Related questions
                            
                                Pydoop stucks on readline from HDFS files
                            
                                Spark Task not serializable (Case Classes)
                            
                                Why is Dockerized Hadoop datanode registering with the wrong IP address?
                            
                                Hadoop Yarn: How to limit dynamic self allocation of resources with Spark?
                            
                                spark: SAXParseException while writing to parquet on s3
                            
                                How does back pressure property work in Spark Streaming?
                            
                                YARN: Containers and JVM
                            
                                Spark Shell with Yarn - Error: Yarn application has already ended! It might have been killed or unable to launch application master
                            
                                Spring-Batch for a massive nightly / hourly Hive / MySQL data processing
                            
                                Problem starting tasktracker in hadoop under windows
                            
                                Running Hadoop MapReduce, is it possible to call external executables outside of HDFS
                            
                                How to pull data in the Map/Reduce functions?
                            
                                Installing PIG on single node
                            
                                What is mean by implementing a advanced job control framework to help chain multiple Map-Reduce jobs?
                            
                                Distributing Data Nodes Across Multiple Data Centers
                            
                                Missing Hive Execution Jar: /usr/local/hadoop/hive/lib/hive-exec-*.jar
                            
                                pig to hadoop issue: Server IPC version 7 cannot communicate with client version 4
                            
                                Impala cannot find com.mysql.jdbc.Driver
                            
                                MapReduce job in headless environment fails N times due to AM Container exception from container-launch
                            
                                JVM crashes with no frame specified, only "timer expired, abort"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to insert data into Parquet table in Hive

Tags:

hadoop

hive

parquet

Alex L

People also ask

1 Answers

dbustosp

Recent Activity

Donate For Us