Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL : HiveContext don't ignore header

I have a spark Job that read data from an External Hive Table and do some transformation and re-save data in another internal Hive Table

val sparkConf = new SparkConf().setAppName("Bulk Merge Daily Load Job")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sparkContext)

// Data Ingestion
val my_df = sqlContext.sql("select * from test")

// Transformation 
...
...

// Save Data into Hive
my_df.write.format("orc")
.option("orc.compress","SNAPPY")
.mode(SaveMode.Overwrite)
.saveAsTable("my_internal_table")

The external Table is created with the this tblproperties line :

tblproperties ("skip.header.line.count"="1");

My problem is that i found in my rows in the my_internal_table Table an additional line representing the columns name .

I guess this is related to this issue :

I am using spark 1.6.0

Can you help me on this :

  • Is this bug still occuring in 1.6.0 ?
  • Is there any simple way to avoid this ?

PS : I am processing large file > 10Go .

Thanks in advance for your response.

like image 470
Nabil Avatar asked Oct 16 '25 16:10

Nabil


1 Answers

I ran into the same issue, but if you save the same table as ORC, it should work. Just create a new table with the same schema as your original one, but set the format to ORC. Then backfill the data from original table into the ORC one.

When you read the ORC table from Spark, it should not bring in the header row.

Hope that helps!

like image 121
suba1 Avatar answered Oct 18 '25 10:10

suba1



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!