Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode).

The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).

is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....).

Question can be considered as an extension to :How to save DataFrame directly to Hive?

any help is much appreciated.

like image 342
Chetandalal Avatar asked Jul 10 '15 13:07

Chetandalal


People also ask

Can we create dynamic partitions in Hive using Spark?

From the Writing Into Dynamic Partitions Using Spark: Spark now writes data partitioned just as Hive would — which means only the partitions that are touched by the INSERT query get overwritten and the others are not touched.

How do I create a partition table in Pyspark?

Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations. Partition on disk: While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns using partitionBy() of pyspark. sql. DataFrameWriter .


2 Answers

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

or

df.write.partitionBy('year', 'month').insertInto(...)
like image 60
mdurant Avatar answered Oct 08 '22 02:10

mdurant


I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
like image 41
Jins George Avatar answered Oct 08 '22 03:10

Jins George