Can we load Parquet file into Hive directly?

Question

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the same using Hive. I have been reading many articles but I am still confused.

Simply put, I have a parquet file - say users.parquet. Now I am struck here on how to load/insert/import data from the users.parquet into hive (obviously into a table).

Please advise or point me in right direction if I am missing something obvious.

Creating hive table using parquet file metadata

https://phdata.io/examples-using-textfile-and-parquet-with-hive-and-impala/

Ram Manohar · Accepted Answer

Get schema of the parquet file using parquet tools, for details check link http://kitesdk.org/docs/0.17.1/labs/4-using-parquet-tools-solution.html

and build table using the schema on the top of the file, for details check Create Hive table to read parquet files from parquet/avro schema

Hendrik F · Answer

Getting the schema is crucial, as you will have to create the table with the appropriate schema first in Hive and then point it to the parquet files.

I had a similar problem, where I had data in one VM and had to move it to another. Here is my walkthrough:

Find out about original Parquet files are (location and schema): describe formatted users; and show create table users; The latter will get you the schema right away and also point you to the location of HDFS hdfs://hostname:port/apps/hive/warehouse/users
Find out about the partitioning of your table show partitions users;
Copy the table's Parquet files from HDFS to a local directory
```
hdfs dfs -copyToLocal /apps/hive/warehouse/users
```
Move them across to the other cluster/VM or where you want them to go
Create the users table on your destination CREATE USERS ... by using the same schema
```
CREATE TABLE users ( name string, ... )
PARTITIONED BY (...)
STORED AS PARQUET;
```
Now, move the Parquet files in the respective folder (if needed find out about the location of the table you've just created)
```
hdfs dfs -ls /apps/hive/warehouse/users/
hdfs dfs -copyFromLocal ../temp/* /apps/hive/warehouse/
```
For each partition, you'll have to point Hive to the respective subdirectory: alter table users add partition (sign_up_date='19991231') location '/apps/hive/warehouse/users/sign_up_date=19991231'; (you might want to do this with a bash script)

That worked for me, hope it helps.

Fabian · Answer

Don't know if it's a bit "hacky" but I use zeppelin (shipped with ambari). You can simply do the following in combination with spark2:

%spark2
import org.apache.spark.sql.SaveMode

var df = spark.read.parquet("hdfs:///my_parquet_files/*.parquet");
df.write.mode(SaveMode.Overwrite).saveAsTable("imported_table")

The advantage of this way is that you can also import many parquet files even if they have a different schema.

Can we load Parquet file into Hive directly?

Tags:

apache-spark-sql

hadoop

hive

parquet

hiveql

annunarcist

3 Answers

Ram Manohar

Hendrik F

Fabian

Recent Activity

Donate For Us

Can we load Parquet file into Hive directly?

Tags:

apache-spark-sql

hadoop

hive

parquet

hiveql

annunarcist

3 Answers

Ram Manohar

Hendrik F

Fabian

Related questions

Recent Activity

Donate For Us