Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we load Parquet file into Hive directly?

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the same using Hive. I have been reading many articles but I am still confused.

Simply put, I have a parquet file - say users.parquet. Now I am struck here on how to load/insert/import data from the users.parquet into hive (obviously into a table).

Please advise or point me in right direction if I am missing something obvious.

Creating hive table using parquet file metadata

https://phdata.io/examples-using-textfile-and-parquet-with-hive-and-impala/

like image 306
annunarcist Avatar asked Dec 16 '15 03:12

annunarcist


3 Answers

Get schema of the parquet file using parquet tools, for details check link http://kitesdk.org/docs/0.17.1/labs/4-using-parquet-tools-solution.html

and build table using the schema on the top of the file, for details check Create Hive table to read parquet files from parquet/avro schema

like image 71
Ram Manohar Avatar answered Oct 29 '22 10:10

Ram Manohar


Getting the schema is crucial, as you will have to create the table with the appropriate schema first in Hive and then point it to the parquet files.

I had a similar problem, where I had data in one VM and had to move it to another. Here is my walkthrough:

  1. Find out about original Parquet files are (location and schema): describe formatted users; and show create table users; The latter will get you the schema right away and also point you to the location of HDFS hdfs://hostname:port/apps/hive/warehouse/users

  2. Find out about the partitioning of your table show partitions users;

  3. Copy the table's Parquet files from HDFS to a local directory

    hdfs dfs -copyToLocal /apps/hive/warehouse/users
    
  4. Move them across to the other cluster/VM or where you want them to go

  5. Create the users table on your destination CREATE USERS ... by using the same schema

    CREATE TABLE users ( name string, ... )
    PARTITIONED BY (...)
    STORED AS PARQUET;
    
  6. Now, move the Parquet files in the respective folder (if needed find out about the location of the table you've just created)

    hdfs dfs -ls /apps/hive/warehouse/users/
    hdfs dfs -copyFromLocal ../temp/* /apps/hive/warehouse/
    
  7. For each partition, you'll have to point Hive to the respective subdirectory: alter table users add partition (sign_up_date='19991231') location '/apps/hive/warehouse/users/sign_up_date=19991231'; (you might want to do this with a bash script)

That worked for me, hope it helps.

like image 11
Hendrik F Avatar answered Oct 29 '22 10:10

Hendrik F


Don't know if it's a bit "hacky" but I use zeppelin (shipped with ambari). You can simply do the following in combination with spark2:

%spark2
import org.apache.spark.sql.SaveMode

var df = spark.read.parquet("hdfs:///my_parquet_files/*.parquet");
df.write.mode(SaveMode.Overwrite).saveAsTable("imported_table")

The advantage of this way is that you can also import many parquet files even if they have a different schema.

like image 6
Fabian Avatar answered Oct 29 '22 10:10

Fabian