Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to show the scheme (including type) of a parquet file from command line or spark shell?

I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

println(sqlContext.parquetFile(path))

prints:

ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None
like image 589
samthebest Avatar asked Mar 06 '15 18:03

samthebest


2 Answers

You should be able to do this:

sqlContext.read.parquet(path).printSchema()

From Spark docs:

// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
like image 75
BAR Avatar answered Sep 18 '22 22:09

BAR


OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)

sqlContext.parquetFile(p).first()

At some point prints:

{
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}
 fileSchema: message schema {
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}
like image 34
samthebest Avatar answered Sep 19 '22 22:09

samthebest