I am trying to parse out the Location from Hive partitioned table in Spark using this query:
val dsc_table = spark.sql("DESCRIBE FORMATTED data_db.part_table")
I was not able to find any query or any other way in Spark to specifically select Location column from this query.
df.inputFiles method in dataframe API will print file path. It returns a best-effort snapshot of the files that compose this DataFrame.
spark.read.table("DB.TableName").inputFiles
Array[String]: = Array(hdfs://test/warehouse/tablename)
You can also use .toDF
method on desc formatted table
then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With