Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting HDFS Location of Hive Table in Spark

I am trying to parse out the Location from Hive partitioned table in Spark using this query:

val dsc_table = spark.sql("DESCRIBE FORMATTED data_db.part_table")

I was not able to find any query or any other way in Spark to specifically select Location column from this query.

like image 394
Vin Avatar asked Mar 03 '23 20:03

Vin


2 Answers

df.inputFiles method in dataframe API will print file path. It returns a best-effort snapshot of the files that compose this DataFrame.

spark.read.table("DB.TableName").inputFiles
Array[String]: = Array(hdfs://test/warehouse/tablename)
like image 133
SantoshK Avatar answered Mar 12 '23 02:03

SantoshK


You can also use .toDF method on desc formatted table then filter from dataframe.

DataframeAPI:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString

Result:

String = hdfs://nn:8020/location/part_table

(or)

RDD Api:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc

Result:

String = /location/part_table
like image 39
notNull Avatar answered Mar 12 '23 00:03

notNull