Spark SQL : HiveContext don't ignore header

Question

I have a spark Job that read data from an External Hive Table and do some transformation and re-save data in another internal Hive Table

val sparkConf = new SparkConf().setAppName("Bulk Merge Daily Load Job")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sparkContext)

// Data Ingestion
val my_df = sqlContext.sql("select * from test")

// Transformation 
...
...

// Save Data into Hive
my_df.write.format("orc")
.option("orc.compress","SNAPPY")
.mode(SaveMode.Overwrite)
.saveAsTable("my_internal_table")

The external Table is created with the this tblproperties line :

tblproperties ("skip.header.line.count"="1");

My problem is that i found in my rows in the my_internal_table Table an additional line representing the columns name .

I guess this is related to this issue :

I am using spark 1.6.0

Can you help me on this :

Is this bug still occuring in 1.6.0 ?
Is there any simple way to avoid this ?

PS : I am processing large file > 10Go .

Thanks in advance for your response.

suba1 · Accepted Answer

I ran into the same issue, but if you save the same table as ORC, it should work. Just create a new table with the same schema as your original one, but set the format to ORC. Then backfill the data from original table into the ORC one.

When you read the ORC table from Spark, it should not bring in the header row.

Hope that helps!

When you read the ORC table from Spark, it should not bring in the header row.

Hope that helps!

Spark SQL : HiveContext don't ignore header

Tags:

apache-spark

apache-spark-sql

hadoop

hive

Nabil

1 Answers

suba1

Recent Activity

Donate For Us

Spark SQL : HiveContext don't ignore header

Tags:

apache-spark

apache-spark-sql

hadoop

hive

Nabil

1 Answers

suba1

Related questions

Recent Activity

Donate For Us