Spark 1.6 onwards as per the official doc we cannot add specific hive partitions to DataFrame Till Spark 1.5 the following used to work and the dataframe would have entity column and the data, as shown below: <pre class="prettyprint"><code>DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz") </code></pre> However, this would not work in Spark 1.6. If I give base path like the following it does not contain entity column which I want in DataFrame, as shown below - <pre class="prettyprint"><code>DataFrame df = hiveContext.read().format("orc").load("path/to/table/") </code></pre> How do I load specific hive partition in a dataframe? What was the driver behind removing this feature? I believe it was efficient. Is there an alternative to achieve that in Spark 1.6? As per my understanding, Spark 1.6 loads all partitions and if I filter for specific partitions it is not efficient, it hits memory and throws GC(Garbage Collection) errors because of thousands of partitions get loaded into memory and not the specific partition.

To add specific partition in a DataFrame using Spark 1.6 we have to do the following first set <code>basePath</code> and then give path of partition needs to be loaded <pre class="prettyprint"><code>DataFrame df = hiveContext.read().format("orc"). option("basePath", "path/to/table/"). load("path/to/table/entity=xyz") </code></pre> So above code will load only specific partition in a DataFrame.

How to load specific Hive partition in DataFrame Spark 1.6?

Tags:

apache-spark

apache-spark-sql

hive

Spark 1.6 onwards as per the official doc we cannot add specific hive partitions to DataFrame

Till Spark 1.5 the following used to work and the dataframe would have entity column and the data, as shown below:

DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz")

However, this would not work in Spark 1.6.

If I give base path like the following it does not contain entity column which I want in DataFrame, as shown below -

DataFrame df = hiveContext.read().format("orc").load("path/to/table/")

How do I load specific hive partition in a dataframe? What was the driver behind removing this feature?

I believe it was efficient. Is there an alternative to achieve that in Spark 1.6?

As per my understanding, Spark 1.6 loads all partitions and if I filter for specific partitions it is not efficient, it hits memory and throws GC(Garbage Collection) errors because of thousands of partitions get loaded into memory and not the specific partition.

968

asked Jan 07 '16 15:01

Umesh K

1 Answers

To add specific partition in a DataFrame using Spark 1.6 we have to do the following first set basePath and then give path of partition needs to be loaded

DataFrame df = hiveContext.read().format("orc").
               option("basePath", "path/to/table/").
               load("path/to/table/entity=xyz")

So above code will load only specific partition in a DataFrame.

149

answered Oct 03 '22 09:10

Umesh K

Related questions
                            
                                Spark decimal type precision loss
                            
                                Comparison of a `float` to `np.nan` in Spark Dataframe
                            
                                How do I get a spark dataframe to print it's explain plan to a string
                            
                                How to find the max String length of a column in Spark using dataframe?
                            
                                Spark: How to aggregate/reduce records based on time difference?
                            
                                Reading Excel (.xlsx) file in pyspark
                            
                                What is the optimal way to read from multiple Kafka topics and write to different sinks using Spark Structured Streaming?
                            
                                Elasticsearch for spark 3.0
                            
                                "'JavaPackage' object is not callable" error executing explain() in Pyspark 3.0.1 via Zeppelin
                            
                                Workaround for Scala RDD not being covariant
                            
                                Apache Spark ALS Recommendation Rating values higher than range
                            
                                Spark: Counting co-occurrence - Algorithm for efficient multi-pass filtering of huge collections
                            
                                Joining two spark dataframes on time (TimestampType) in python
                            
                                write an RDD into HDFS in a spark-streaming context
                            
                                Writing to Oracle Database using Apache Spark 1.4.0
                            
                                SPARK SQL Equivalent of Qualify + Row_number statements
                            
                                What does $( ) mean in Scala?
                            
                                Iterated take() or batch processing for Spark?
                            
                                Spark dataframes: Extract a column based on the value of another column
                            
                                Avro Schema to spark StructType

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load specific Hive partition in DataFrame Spark 1.6?

Tags:

apache-spark

apache-spark-sql

hive

Umesh K

People also ask

1 Answers

Umesh K

Recent Activity

Donate For Us