Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL HiveContext - saveAsTable creates wrong schema

I try to store a Dataframe to a persistent Hive table in Spark 1.3.0 (PySpark). This is my code:

sc = SparkContext(appName="HiveTest")
hc = HiveContext(sc)
peopleRDD = sc.parallelize(['{"name":"Yin","age":30}'])
peopleDF = hc.jsonRDD(peopleRDD)
peopleDF.printSchema()
#root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
peopleDF.saveAsTable("peopleHive")

The Hive output table I expect is:

Column  Data Type   Comments
age     long        from deserializer
name    string      from deserializer

But the actual output Hive table of the above code is:

Column  Data Type       Comments
col     array<string>   from deserializer

Why is the Hive table not the same schema as the DataFrame? How to achieve the expected output?

like image 449
Mirko Avatar asked May 14 '15 09:05

Mirko


People also ask

What is default storage format in Spark SQL?

Spark SQL currently uses MEMORY_ONLY as the default format.

What is spark SQL Hive convertMetastoreParquet?

spark.sql.hive.convertMetastoreParquetControls whether to use the built-in Parquet reader and writer for Hive tables with the parquet storage format (instead of Hive SerDe).


1 Answers

It's not the schema is wrong. Hive is not able to correctly read table created by Spark, because it doesn't even have the right parquet serde yet. If you do sqlCtx.sql('desc peopleHive').show(), it should show the correct schema. Or you can use the spark-sql client instead of hive. You can also use the create table syntax to create external tables, which works just like Hive, but Spark has much better support for parquet.

like image 123
user3931226 Avatar answered Nov 07 '22 16:11

user3931226