merge multiple small files in to few larger files in Spark

Tags:

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.

'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")

 val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")

val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")

val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")

val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")

val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")

val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'

The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?

771

asked Jun 23 '15 17:06

dheee

1 Answers

I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:

INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date

answered Oct 12 '22 13:10

Jussi Kujala

Related questions
                            
                                How can I use sbt.IO?
                            
                                How to set environmental variable from Scala?
                            
                                Kafka Quickstart: What Dependencies do I need?
                            
                                Sort a list by an ordered index
                            
                                Haskell-like type-constrained trait implementation in Scala (?)
                            
                                Can't access Parent's Members while dealing with Macro Annotations
                            
                                Scala Singleton Object with Multi-threading
                            
                                Group ScalaTest tests and run in order
                            
                                How to declare dependency on Play's Anorm for a standalone application?
                            
                                Random as instance of scalaz.Monad
                            
                                Scala allowing call to java.util.HashMap get method with the wrong number of parameters
                            
                                scala case class put methods in companion object?
                            
                                scala code throw exception in spark
                            
                                Get a TypeTag from a Type?
                            
                                Why warning when scala.language.implicitConversions is not the last import?
                            
                                High performance string hashing function in Java/Scala [closed]
                            
                                How do purely functional compilers annotate the AST with type info?
                            
                                Type Constructor as Return Type
                            
                                akka: pattern for combining messages from multiple children
                            
                                Can you sort a mutable Scala collection in place?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

merge multiple small files in to few larger files in Spark

Tags:

scala

apache-spark

apache-spark-sql

hadoop

hive

dheee

People also ask

1 Answers

Jussi Kujala

Recent Activity

Donate For Us