I have a spark Job that read data from an External Hive Table and do some transformation and re-save data in another internal Hive Table
val sparkConf = new SparkConf().setAppName("Bulk Merge Daily Load Job")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sparkContext)
// Data Ingestion
val my_df = sqlContext.sql("select * from test")
// Transformation
...
...
// Save Data into Hive
my_df.write.format("orc")
.option("orc.compress","SNAPPY")
.mode(SaveMode.Overwrite)
.saveAsTable("my_internal_table")
The external Table is created with the this tblproperties
line :
tblproperties ("skip.header.line.count"="1");
My problem is that i found in my rows in the my_internal_table
Table an additional line representing the columns name .
I guess this is related to this issue :
I am using spark 1.6.0
Can you help me on this :
1.6.0
? PS : I am processing large file > 10Go .
Thanks in advance for your response.
I ran into the same issue, but if you save the same table as ORC, it should work. Just create a new table with the same schema as your original one, but set the format to ORC. Then backfill the data from original table into the ORC one.
When you read the ORC table from Spark, it should not bring in the header row.
Hope that helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With