Spark Structured Streaming Writestream to Hive ORC Partioned External Table

Tags:

I am trying to use Spark Structured Streaming - writeStream API to write to an External Partitioned Hive table.

CREATE EXTERNAL TABLE `XX`(
`a` string,
`b` string,
`b` string,
`happened` timestamp,
`processed` timestamp,
`d` string,
`e` string,
`f` string )
 PARTITIONED BY (
`year` int, `month` int, `day` int)      
 CLUSTERED BY (d)
INTO 6 BUCKETS
STORED AS ORC 
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED');

and in Spark code,

val hiveOrcWriter:   DataStreamWriter[Row] = event_stream
  .writeStream
  .outputMode("append")
  .format("orc")
  .partitionBy("year","month","day")
  //.option("compression", "zlib")
  .option("path", _table_loc)
  .option("checkpointLocation", _table_checkpoint)

I see that on a non partition table, records are inserted into Hive. However, on using partitioned table, the spark job does not fail or raise exceptions but records are not inserted to Hive table.

Appreciate comments from anyone who has dealt with similar problems.

Edit:

Just discovered that the .orc files are indeed written to the HDFS, withe correct partition directory structure: eg. /_table_loc/_table_name/year/month/day/part-0000-0123123.c000.snappy.orc

However

select * from 'XX' limit 1; (or where year=2018)

returns no rows.

The InputFormat and OutputFormat for the Table 'XX' are org.apache.hadoop.hive.ql.io.orc.OrcInputFormat and org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat respectively.

703

asked Aug 11 '18 22:08

irrelevantUser

Video Answer

1 Answers

This feature isn't provided out of the box in structured streaming. In normal processing, you would use dataset.write.saveAsTable(table_name) , and that method isn't available.

After processing and saving the data in HDFS, you can manually update the partitions (or using a script that does this on a schedule):

If you use Hive

MSCK REPAIR TABLE table_name

If you use Impala

ALTER TABLE table_name RECOVER PARTITIONS

184

answered Oct 13 '22 19:10

Shikkou

Related questions
                            
                                Am I fully utilizing my EMR cluster?
                            
                                How to log malformed rows from Scala Spark DataFrameReader csv
                            
                                How to transform Dataset<Tuple2<String,DeviceData>> to Iterator<DeviceData>
                            
                                Naive install of PySpark to also support S3 access
                            
                                Broadcast a user defined class in Spark
                            
                                Do not discard keys with null values when converting to JSON in PySpark DataFrame
                            
                                Running Python startup code after modules are loaded
                            
                                How to use PySpark to load a rolling window from daily files?
                            
                                What is the difference between tensorflow on spark with the default distributed tensorflow 1.0?
                            
                                Spark error - Decimal precision 39 exceeds max precision 38
                            
                                Unsupported literal type class in Apache Spark in scala
                            
                                Spark-Streaming Kafka Direct Streaming API & Parallelism
                            
                                How to save a spark dataframe to csv on HDFS?
                            
                                Is there no "inverse_transform" method for a scaler like MinMaxScaler in spark?
                            
                                Read CSV with linebreaks in pyspark
                            
                                Serve real-time predictions with trained Spark ML model [duplicate]
                            
                                Spark Streaming Guarantee Specific Start Window Time
                            
                                How read table with non utf-8 encoding in aws gllue?
                            
                                Error: Could not find or load main class org.apache.spark.launcher.Main [duplicate]
                            
                                Export environment variables at runtime with airflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Structured Streaming Writestream to Hive ORC Partioned External Table

Tags:

apache-spark

hive

orc

spark-structured-streaming

hive-partitions

irrelevantUser

People also ask

Video Answer

1 Answers

Shikkou

Recent Activity

Donate For Us