Creating hive table using parquet file metadata

Tags:

I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.

Output from writing parquet write

_common_metadata  part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  _SUCCESS
_metadata         part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet

Hive table

CREATE  TABLE testhive
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/home/gz_files/result';



FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified

How can I infer the meta data from parquet file?

If I open the _common_metadata I have below content,

PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata▒{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}

Or how to parse meta data file?

564

asked Nov 10 '15 08:11

WoodChopper

1 Answers

Here's a solution I've come up with to get the metadata from parquet files in order to create a Hive table.

First start a spark-shell (Or compile it all into a Jar and run it with spark-submit, but the shell is SOO much easier)

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame


val df=sqlContext.parquetFile("/path/to/_common_metadata")

def creatingTableDDL(tableName:String, df:DataFrame): String={
  val cols = df.dtypes
  var ddl1 = "CREATE EXTERNAL TABLE "+tableName + " ("
  //looks at the datatypes and columns names and puts them into a string
  val colCreate = (for (c <-cols) yield(c._1+" "+c._2.replace("Type",""))).mkString(", ")
  ddl1 += colCreate + ") STORED AS PARQUET LOCATION '/wherever/you/store/the/data/'"
  ddl1
}

val test_tableDDL=creatingTableDDL("test_table",df,"test_db")

It will provide you with the datatypes that Hive will use for each column as they are stored in Parquet. E.G: CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'

102

answered Sep 21 '22 23:09

James Tobin

Related questions
                            
                                Spray/Akka missing implicit
                            
                                filter a List according to multiple contains
                            
                                Flatten a list of tuples in Scala?
                            
                                Error: not found: value lit/when - spark scala
                            
                                Spark : Average of values instead of sum in reduceByKey using Scala
                            
                                Scala: how to inherit a "static slot"?
                            
                                Implementing a Measured value in Scala
                            
                                Adding two Set[Any]
                            
                                What are the performance and maintenance considerations of using Scala's structural types?
                            
                                Backquote Used in Scala Swing Event
                            
                                How would I represent a compound key in Scala?
                            
                                What does _ :: mean in Scala?
                            
                                how to avoid race conditions with scala parallel collections
                            
                                Scala, Extend object with a generic trait
                            
                                Local dependencies resolved by SBT but not by Play! Framework
                            
                                Scala: How can I split a String into a Map
                            
                                Scala: Does Option[Boolean] makes sense?
                            
                                Scala: how can I generate numbers according to an expected distribution?
                            
                                Writing to HBase via Spark: Task not serializable
                            
                                Scala break out of map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating hive table using parquet file metadata

Tags:

scala

apache-spark

hive

parquet

WoodChopper

People also ask

1 Answers

James Tobin

Recent Activity

Donate For Us