I am using spark 2.0 and I was wondering ,Is it possible to list all files for specific hive table? If so, I can incrementally update those files directly using spark sc.textFile("file.orc")
.
How can I add a new partition to hive table? is there any api on the hive metastore that I can use from spark?
Is there any way to get the internal hive function that map dataframe row => partition_path
my main reasoning is incremental updates for a table. right now the only way I have figured out is FULL OUTER JOIN
SQL +SaveMode.Overwrite
, which is not so efficient because he will overwrite all the table while my main interest is incremental updates for some specific partitions/adding new partition
EDIT
from what I have saw on the HDFS, when SaveMode.Overwrite spark will emit the table definition i.e CREATE TABLE my_table .... PARTITION BY (month,..)
. spark is putting all files under the $HIVE/my_table
and not under $HIVE/my_table/month/...
which means he is not partitioning the data. when I wrote df.write.partitionBy(...).mode(Overwrite).saveAsTable("my_table")
I have saw on hdfs that it is correct.
I have used SaveMode.Overwrite
because I am updating records and not appending data.
I load data using spark.table("my_table")
which means spark lazily load the table which is a problem since I don't want to load all the table just part of if.
for the question:
1.Does spark going to shuffle the data because I have used partitionBy()
,or he compares current partition and if its the same he will not shuffle the data.
2.Does spark smart enough to use partition pruning when mutating part from the data i.e just for specific month/year, and apply that change instead of loading all the data? (FULL OUTER JOIN is basically operation that scan all the table)
Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. Each table in the hive can have one or more partition keys to identify a particular partition.
You can run the HDFS list command to show all partition folders of a table from the Hive data warehouse location.
Adding partitions:
Adding partition from spark can be done with partitionBy
provided in DataFrameWriter
for non-streamed or with DataStreamWriter
for streamed data.
public DataFrameWriter<T> partitionBy(scala.collection.Seq<String> colNames)
so if you want to partition data by year
and month
spark will save the data to folder like:
year=2016/month=01/
year=2016/month=02/
You have mentioned orc
- you can use saving as a orc
format with:
df.write.partitionBy('year', 'month').format("orc").save(path)
but you can easily insert into hive table like:
df.write.partitionBy('year', 'month').insertInto(String tableName)
Getting all partitions:
Spark sql is based on hive query language so you can use SHOW PARTITIONS
to get list of partitions in the specific table.
sparkSession.sql("SHOW PARTITIONS partitionedHiveTable")
Just make sure you have .enableHiveSupport()
when you are creating session with SparkSessionBuilder
and also make sure whether you have hive-conf.xml
etc. configured properly
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With