I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
println(sqlContext.parquetFile(path))
prints:
ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None
You should be able to do this:
sqlContext.read.parquet(path).printSchema()
From Spark docs:
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)
sqlContext.parquetFile(p).first()
At some point prints:
{
optional binary cust_id;
optional binary blar;
optional double foo;
}
fileSchema: message schema {
optional binary cust_id;
optional binary blar;
optional double foo;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With