I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method
df.saveAsTable(tablename,mode)
.
The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).
is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....)
.
Question can be considered as an extension to :How to save DataFrame directly to Hive?
any help is much appreciated.
From the Writing Into Dynamic Partitions Using Spark: Spark now writes data partitioned just as Hive would — which means only the partitions that are touched by the INSERT query get overwritten and the others are not touched.
Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations. Partition on disk: While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns using partitionBy() of pyspark. sql. DataFrameWriter .
I believe it works something like this:
df
is a dataframe with year, month and other columns
df.write.partitionBy('year', 'month').saveAsTable(...)
or
df.write.partitionBy('year', 'month').insertInto(...)
I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")
I had to enable the following properties to make it work.
hiveContext.setConf("hive.exec.dynamic.partition", "true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With