save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

Tags:

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.

The documentation states:

"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."

Looking at the Spark tutorial, is seems that this property can be set:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")

# code to create dataframe

my_dataframe.saveAsTable("my_dataframe")

However, when I try to query the saved table in Hive it returns:

hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile

How do I save the table so that it's immediately readable in Hive?

932

asked Jul 17 '15 18:07

Alex Woolford

2 Answers

I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.

If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)

131

answered Sep 28 '22 06:09

Leet-Falcon

I hit this issue last week and was able to find a workaround

Here's the story: I can see the table in Hive if I created the table without partitionBy:

spark-shell>someDF.write.mode(SaveMode.Overwrite)
                  .format("parquet")
                  .saveAsTable("TBL_HIVE_IS_HAPPY")

hive> desc TBL_HIVE_IS_HAPPY;
      OK
      user_id                   string                                      
      email                     string                                      
      ts                        string

But Hive can't understand the table schema(schema is empty...) if I do this:

spark-shell>someDF.write.mode(SaveMode.Overwrite)
                  .format("parquet")
                  .saveAsTable("TBL_HIVE_IS_NOT_HAPPY")

hive> desc TBL_HIVE_IS_NOT_HAPPY;
      # col_name                data_type               from_deserializer

[Solution]:

spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
              .partitionBy("ts")
              .mode(SaveMode.Overwrite)
              .saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE


hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
                                       PARTITIONED BY(day STRING)
                                       STORED AS PARQUET
                                       LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;

The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location. I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!

answered Sep 28 '22 08:09

Yuan Zhao

Related questions
                            
                                How does range partitioner work in Spark?
                            
                                How to add new field to struct column?
                            
                                Stop Structured Streaming query gracefully
                            
                                Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
                            
                                Convert scala list to DataFrame or DataSet
                            
                                Can't find spark submit when typing spark-shell
                            
                                spark-class: line 71...No such file or directory
                            
                                Convert Row to map in spark scala
                            
                                Error when Spark 2.2.0 standalone mode write Dataframe to local single-node Kafka
                            
                                How to rename duplicated columns after join? [duplicate]
                            
                                Who can give a clear explanation for `combineByKey` in Spark?
                            
                                How to get applicationId of Spark application deployed to YARN in Scala?
                            
                                How to use functions provide by DataFrameNaFunctions class in Spark, on a Dataframe?
                            
                                Spark UDF error - Schema for type Any is not supported
                            
                                Apache Spark: Difference between parallelize and broadcast
                            
                                Issue while opening Spark shell
                            
                                pyspark: counter part of like() method in dataframe
                            
                                Spark avoid creating _temporary directory in S3
                            
                                Is there any better way to convert Array<int> to Array<String> in pyspark
                            
                                Change schema of existing dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

Tags:

apache-spark

apache-spark-sql

pyspark

hive

Alex Woolford

People also ask

2 Answers

Leet-Falcon

Yuan Zhao

Recent Activity

Donate For Us