How to show the scheme (including type) of a parquet file from command line or spark shell?

Question

I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

println(sqlContext.parquetFile(path))

prints:

ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None

BAR · Accepted Answer

You should be able to do this:

sqlContext.read.parquet(path).printSchema()

From Spark docs:

// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

samthebest · Answer

OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)

sqlContext.parquetFile(p).first()

At some point prints:

{
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}
 fileSchema: message schema {
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}

How to show the scheme (including type) of a parquet file from command line or spark shell?

Tags:

scala

apache-spark

parquet

samthebest

2 Answers

BAR

samthebest

Recent Activity

Donate For Us

How to show the scheme (including type) of a parquet file from command line or spark shell?

Tags:

scala

apache-spark

parquet

samthebest

2 Answers

BAR

samthebest

Related questions

Recent Activity

Donate For Us