I try to store a Dataframe to a persistent Hive table in Spark 1.3.0 (PySpark). This is my code:
sc = SparkContext(appName="HiveTest")
hc = HiveContext(sc)
peopleRDD = sc.parallelize(['{"name":"Yin","age":30}'])
peopleDF = hc.jsonRDD(peopleRDD)
peopleDF.printSchema()
#root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
peopleDF.saveAsTable("peopleHive")
The Hive output table I expect is:
Column Data Type Comments
age long from deserializer
name string from deserializer
But the actual output Hive table of the above code is:
Column Data Type Comments
col array<string> from deserializer
Why is the Hive table not the same schema as the DataFrame? How to achieve the expected output?
Spark SQL currently uses MEMORY_ONLY as the default format.
spark.sql.hive.convertMetastoreParquetControls whether to use the built-in Parquet reader and writer for Hive tables with the parquet storage format (instead of Hive SerDe).
It's not the schema is wrong. Hive is not able to correctly read table created by Spark, because it doesn't even have the right parquet serde yet.
If you do sqlCtx.sql('desc peopleHive').show()
, it should show the correct schema.
Or you can use the spark-sql client instead of hive. You can also use the create table syntax to create external tables, which works just like Hive, but Spark has much better support for parquet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With