I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method <code>df.saveAsTable(tablename,mode)</code>. The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table). is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using <code>hivesqlcontext.sql(insert into table partittioin by(date)....)</code>. Question can be considered as an extension to :How to save DataFrame directly to Hive? any help is much appreciated.

I believe it works something like this: <code>df</code> is a dataframe with year, month and other columns <pre class="prettyprint"><code>df.write.partitionBy('year', 'month').saveAsTable(...) </code></pre> or <pre class="prettyprint"><code>df.write.partitionBy('year', 'month').insertInto(...) </code></pre>

I was able to write to partitioned hive table using <code>df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")</code> I had to enable the following properties to make it work. <pre class="prettyprint"> hiveContext.setConf("hive.exec.dynamic.partition", "true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") </pre>

Save Spark dataframe as dynamic partitioned table in Hive

Tags:

apache-spark

apache-spark-sql

hadoop

hive

spark-dataframe

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode).

The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).

is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....).

Question can be considered as an extension to :How to save DataFrame directly to Hive?

any help is much appreciated.

342

asked Jul 10 '15 13:07

Chetandalal

2 Answers

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

df.write.partitionBy('year', 'month').insertInto(...)

answered Oct 08 '22 02:10

mdurant

I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

answered Oct 08 '22 03:10

Jins George

Related questions
                            
                                Is there a hdfs command to list files in HDFS directory as per timestamp
                            
                                Primary keys with Apache Spark
                            
                                How to write to CSV in Spark
                            
                                There are 0 datanode(s) running and no node(s) are excluded in this operation
                            
                                How can I access S3/S3n from a local Hadoop 2.6 installation?
                            
                                Do exit codes and exit statuses mean anything in spark?
                            
                                How to list only the file names in HDFS
                            
                                How to specify username when putting files on HDFS from a remote machine?
                            
                                What exactly is hadoop namenode formatting?
                            
                                How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?
                            
                                what is HiveServer and Thrift server [closed]
                            
                                Sorting large data using MapReduce/Hadoop
                            
                                Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation
                            
                                Apache Pig: FLATTEN and parallel execution of reducers
                            
                                what is difference between partition and replica of a topic in kafka cluster
                            
                                Skip first line of csv while loading in hive table
                            
                                Running Apache Hadoop 2.1.0 on Windows
                            
                                Why does Hadoop need classes like Text or IntWritable instead of String or Integer?
                            
                                Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?
                            
                                How to find the size of a HDFS file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With