I am using spark 2.0 and I was wondering ,Is it possible to list all files for specific hive table? If so, I can incrementally update those files directly using spark <code>sc.textFile("file.orc") </code>. How can I add a new partition to hive table? is there any api on the hive metastore that I can use from spark? Is there any way to get the internal hive function that map dataframe <code>row => partition_path</code> my main reasoning is incremental updates for a table. right now the only way I have figured out is <code>FULL OUTER JOIN</code> SQL +<code>SaveMode.Overwrite</code>, which is not so efficient because he will overwrite all the table while my main interest is incremental updates for some specific partitions/adding new partition EDIT from what I have saw on the HDFS, when SaveMode.Overwrite spark will emit the table definition i.e <code>CREATE TABLE my_table .... PARTITION BY (month,..)</code>. spark is putting all files under the <code>$HIVE/my_table</code> and not under <code>$HIVE/my_table/month/...</code> which means he is not partitioning the data. when I wrote <code>df.write.partitionBy(...).mode(Overwrite).saveAsTable("my_table")</code> I have saw on hdfs that it is correct. I have used <code>SaveMode.Overwrite</code> because I am updating records and not appending data. I load data using <code>spark.table("my_table")</code> which means spark lazily load the table which is a problem since I don't want to load all the table just part of if. for the question: 1.Does spark going to shuffle the data because I have used <code>partitionBy()</code> ,or he compares current partition and if its the same he will not shuffle the data. 2.Does spark smart enough to use partition pruning when mutating part from the data i.e just for specific month/year, and apply that change instead of loading all the data? (FULL OUTER JOIN is basically operation that scan all the table)

Adding partitions: Adding partition from spark can be done with <code>partitionBy</code> provided in <code>DataFrameWriter</code> for non-streamed or with <code>DataStreamWriter</code> for streamed data. <pre class="prettyprint"><code>public DataFrameWriter<T> partitionBy(scala.collection.Seq<String> colNames) </code></pre> so if you want to partition data by <code>year</code> and <code>month</code> spark will save the data to folder like: <pre class="prettyprint"><code>year=2016/month=01/ year=2016/month=02/ </code></pre> You have mentioned <code>orc</code> - you can use saving as a <code>orc</code> format with: <pre class="prettyprint"><code>df.write.partitionBy('year', 'month').format("orc").save(path) </code></pre> but you can easily insert into hive table like: <pre class="prettyprint"><code>df.write.partitionBy('year', 'month').insertInto(String tableName) </code></pre> Getting all partitions: Spark sql is based on hive query language so you can use <code>SHOW PARTITIONS</code> to get list of partitions in the specific table. <pre class="prettyprint"><code>sparkSession.sql("SHOW PARTITIONS partitionedHiveTable") </code></pre> Just make sure you have <code>.enableHiveSupport()</code> when you are creating session with <code>SparkSessionBuilder</code> and also make sure whether you have <code>hive-conf.xml</code> etc. configured properly

Hive on Spark list all partitions for specific hive table and adding a partition

Tags:

apache-spark

hive

I am using spark 2.0 and I was wondering ,Is it possible to list all files for specific hive table? If so, I can incrementally update those files directly using spark sc.textFile("file.orc"). How can I add a new partition to hive table? is there any api on the hive metastore that I can use from spark?

Is there any way to get the internal hive function that map dataframe row => partition_path

my main reasoning is incremental updates for a table. right now the only way I have figured out is FULL OUTER JOIN SQL +SaveMode.Overwrite, which is not so efficient because he will overwrite all the table while my main interest is incremental updates for some specific partitions/adding new partition

EDIT from what I have saw on the HDFS, when SaveMode.Overwrite spark will emit the table definition i.e CREATE TABLE my_table .... PARTITION BY (month,..). spark is putting all files under the $HIVE/my_table and not under $HIVE/my_table/month/... which means he is not partitioning the data. when I wrote df.write.partitionBy(...).mode(Overwrite).saveAsTable("my_table") I have saw on hdfs that it is correct. I have used SaveMode.Overwrite because I am updating records and not appending data.

I load data using spark.table("my_table") which means spark lazily load the table which is a problem since I don't want to load all the table just part of if.

for the question:

1.Does spark going to shuffle the data because I have used partitionBy() ,or he compares current partition and if its the same he will not shuffle the data.

2.Does spark smart enough to use partition pruning when mutating part from the data i.e just for specific month/year, and apply that change instead of loading all the data? (FULL OUTER JOIN is basically operation that scan all the table)

493

asked Oct 26 '16 16:10

David H

1 Answers

Adding partitions:

Adding partition from spark can be done with partitionBy provided in DataFrameWriter for non-streamed or with DataStreamWriter for streamed data.

public DataFrameWriter<T> partitionBy(scala.collection.Seq<String> colNames)

so if you want to partition data by year and month spark will save the data to folder like:

year=2016/month=01/
year=2016/month=02/

You have mentioned orc - you can use saving as a orc format with:

df.write.partitionBy('year', 'month').format("orc").save(path)

but you can easily insert into hive table like:

df.write.partitionBy('year', 'month').insertInto(String tableName)

Getting all partitions:

Spark sql is based on hive query language so you can use SHOW PARTITIONS to get list of partitions in the specific table.

sparkSession.sql("SHOW PARTITIONS partitionedHiveTable")

Just make sure you have .enableHiveSupport() when you are creating session with SparkSessionBuilder and also make sure whether you have hive-conf.xml etc. configured properly

141

answered Nov 26 '22 14:11

VladoDemcak

Related questions
                            
                                Spark 3.0 is much slower to read json files than Spark 2.4
                            
                                How to compute the mean with Apache spark?
                            
                                Spark Streaming Window Operation
                            
                                Apache Spark - How does internal job scheduler in spark define what are users and what are pools
                            
                                Running custom Java class in PySpark
                            
                                On Spark's RDD's take and takeOrdered methods
                            
                                Operate on neighbor elements in RDD in Spark
                            
                                Cannot load main class from JAR file in Spark Submit
                            
                                Spark job did not find table in Hive database
                            
                                Kryo serializer causing exception on underlying Scala class WrappedArray
                            
                                Calculate the running time for spark sql
                            
                                Spark: Is receiver in spark streaming a bottleneck?
                            
                                reduce() vs. fold() in Apache Spark
                            
                                How to convert column to vector type?
                            
                                java.lang.OutOfMemoryError in pyspark
                            
                                Scala-Spark Dynamically call groupby and agg with parameter values
                            
                                How to count number of occurrences by using pyspark
                            
                                How to install Apache Toree for Spark Kernel in Jupyter in (ana)conda environment?
                            
                                Spark random forest binary classifier metrics
                            
                                Spark History Server on S3A FileSystem: ClassNotFoundException

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With