Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating hive table using parquet file metadata

I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.

Output from writing parquet write

_common_metadata  part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  _SUCCESS
_metadata         part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet

Hive table

CREATE  TABLE testhive
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/home/gz_files/result';



FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified

How can I infer the meta data from parquet file?

If I open the _common_metadata I have below content,

PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata▒{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}

Or how to parse meta data file?

like image 564
WoodChopper Avatar asked Nov 10 '15 08:11

WoodChopper


People also ask

Do Parquet files have metadata?

In addition to the data types, Parquet specification also stores metadata which records the schema at three levels; file, chunk(column) and page header. The footer for each file contains the file metadata.

How do you create a table in Parquet format?

To make the new table also use Parquet format, include the clause STORED AS PARQUET in the CREATE TABLE LIKE PARQUET statement. If the Parquet data file comes from an existing Impala table, currently, any TINYINT or SMALLINT columns are turned into INT columns in the new table.

Does Hive work with Parquet?

Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.


1 Answers

Here's a solution I've come up with to get the metadata from parquet files in order to create a Hive table.

First start a spark-shell (Or compile it all into a Jar and run it with spark-submit, but the shell is SOO much easier)

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame


val df=sqlContext.parquetFile("/path/to/_common_metadata")

def creatingTableDDL(tableName:String, df:DataFrame): String={
  val cols = df.dtypes
  var ddl1 = "CREATE EXTERNAL TABLE "+tableName + " ("
  //looks at the datatypes and columns names and puts them into a string
  val colCreate = (for (c <-cols) yield(c._1+" "+c._2.replace("Type",""))).mkString(", ")
  ddl1 += colCreate + ") STORED AS PARQUET LOCATION '/wherever/you/store/the/data/'"
  ddl1
}

val test_tableDDL=creatingTableDDL("test_table",df,"test_db")

It will provide you with the datatypes that Hive will use for each column as they are stored in Parquet. E.G: CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'

like image 102
James Tobin Avatar answered Sep 21 '22 23:09

James Tobin