I try to store a Dataframe to a persistent Hive table in Spark 1.3.0 (PySpark). This is my code: <pre class="prettyprint"><code>sc = SparkContext(appName="HiveTest") hc = HiveContext(sc) peopleRDD = sc.parallelize(['{"name":"Yin","age":30}']) peopleDF = hc.jsonRDD(peopleRDD) peopleDF.printSchema() #root # |-- age: long (nullable = true) # |-- name: string (nullable = true) peopleDF.saveAsTable("peopleHive") </code></pre> The Hive output table I expect is: <pre class="prettyprint"><code>Column Data Type Comments age long from deserializer name string from deserializer </code></pre> But the actual output Hive table of the above code is: <pre class="prettyprint"><code>Column Data Type Comments col array<string> from deserializer </code></pre> Why is the Hive table not the same schema as the DataFrame? How to achieve the expected output?

It's not the schema is wrong. Hive is not able to correctly read table created by Spark, because it doesn't even have the right parquet serde yet. If you do <code>sqlCtx.sql('desc peopleHive').show()</code>, it should show the correct schema. Or you can use the spark-sql client instead of hive. You can also use the create table syntax to create external tables, which works just like Hive, but Spark has much better support for parquet.

Spark SQL HiveContext - saveAsTable creates wrong schema

Tags:

apache-spark

apache-spark-sql

hive

I try to store a Dataframe to a persistent Hive table in Spark 1.3.0 (PySpark). This is my code:

sc = SparkContext(appName="HiveTest")
hc = HiveContext(sc)
peopleRDD = sc.parallelize(['{"name":"Yin","age":30}'])
peopleDF = hc.jsonRDD(peopleRDD)
peopleDF.printSchema()
#root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
peopleDF.saveAsTable("peopleHive")

The Hive output table I expect is:

Column  Data Type   Comments
age     long        from deserializer
name    string      from deserializer

But the actual output Hive table of the above code is:

Column  Data Type       Comments
col     array<string>   from deserializer

Why is the Hive table not the same schema as the DataFrame? How to achieve the expected output?

449

asked May 14 '15 09:05

Mirko

1 Answers

It's not the schema is wrong. Hive is not able to correctly read table created by Spark, because it doesn't even have the right parquet serde yet. If you do sqlCtx.sql('desc peopleHive').show(), it should show the correct schema. Or you can use the spark-sql client instead of hive. You can also use the create table syntax to create external tables, which works just like Hive, but Spark has much better support for parquet.

123

answered Nov 07 '22 16:11

user3931226

Related questions
                            
                                Using hive table over parquet in Pig
                            
                                Hive: where + in does not use partition?
                            
                                TIMESTAMP format issue in HIVE
                            
                                Delete data from .Trash in hdfs
                            
                                Hive RegexSerDe Multiline Log matching
                            
                                Why hive_staging file is missing in AWS EMR
                            
                                UDF not working in Spark SQL
                            
                                ARRAY_CONTAINS muliple values in hive
                            
                                Is there a performance difference for these two Hive queries joining two tables and filtering on a partition key?
                            
                                how to map column names in a hive table and replace it with new values in hive table
                            
                                How to write subquery in select statement in hive
                            
                                Where does hive stores its table?
                            
                                How to use in-memory Derby database for testing with Hive (Scala)
                            
                                Simple oozie example of hive query?
                            
                                In Hive, how can I add a column only if that column does not exist?
                            
                                Multiple Spark applications with HiveContext
                            
                                Hive: Fast concatenate two tables into one?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With